How to Read the URL Report

The URL Report is a great way to troubleshoot errors in bulk jobs or general logging purposes.

The report is a comma-separated-values (CSV) file, and is available to download from your bulk job status page or crawl job status page on the Dashboard as soon as a job has begun.

1680

The URL Report may also be retrieved programmatically via the Bulk Job Data API or the Crawl Job Data API.

The Bulk API service shares much of its underlying architecture with Crawl API. In addition to this URL Report, other similar operational conventions will also be shared between both APIs.

Each row of the URL report corresponds to a single URL evaluated and provides the following information:

ColumnDescription
URLWeb page URL (normalized). Note that due to URL normalization, URL Report values may not match submitted URLs exactly.
Doc IDDocument ID of the crawled page. This corresponds to the parentUrlDocId field returned in crawl or bulk job JSON data.
URL Discovered TimeTime the URL was first seen/encountered.
Crawled TimeTime the URL was crawled (downloaded and its source spidered for links).
Content LengthNumber of characters comprising the HTML source.
Duplicate OfIf the page source is an exact duplicate of another page, the Doc ID of the duplicate page will be returned.
RedirectsNumber of redirects pursued before arriving at the final destination URL.
Redirected ToUltimate destination URL if redirected.
Robots.txt Crawl Delay (ms)If the page is subject to a robots.txt "crawl delay" the value in milliseconds will be returned.
Crawl RoundIf the bulk job is a repeating/recurring job, the crawl "round" in which this URL was evaluated. Note: URLs will be duplicated for each round in which they are processed.
Crawl Try #Crawl Only: If there is an error crawling the page (spidering for links), any retries will be enumerated.
Hop CountCrawl Only: This indicates the page's distance from seed(s): "1" indicates the URL was linked-to from a seed; "2" indicates the URL appeared on a page that itself was linked-to from a seed. Hops can be used to narrow crawling via Crawlbot's maxHops argument.
Crawl StatusReturns "Success" if the page was successfully crawled (spidered for links).
Diffbot URIIf the page was processed via a Diffbot API, and an object—product, article, image, discussion, etc.—found, the object's diffbotUri will be returned.
Process AttemptedIndicates if the page was sent to a Diffbot API for processing.
Process ResponseIndicates whether or not the Diffbot processing was successful.
Proxy UsedIndicates whether or not a proxy IP address was used for the URL. Read more on proxies.