Getting Started with Bulk Extract

Bulk Extract lets you send a large quantity of URLs through any Diffbot Extract API for fast, asynchronous processing.

The Bulk API sends all submitted page URLs to an Extract API (either automatic or custom). All structured page results are then compiled into a single "collection," which can be downloaded in full or searched.

Note: Bulk Extract is not a crawler: it does not spider a site for additional links. You must supply each URL you wish to process. For crawling/spidering, see Crawl.

For documentation on how to use Bulk Extract via API, check out Introduction to Bulk API.

🚧

Access to Bulk API is Limited to Plus Plans and Up

Upgrade to a Plus plan anytime at diffbot.com/pricing, or contact [email protected] for more information.

How to Use Bulk Extract on the Dashboard

Each bulk job requires the following:

  • a name (e.g. NewProducts).
  • Multiple URLs to process, one per line. Jobs require at least 50 URLs.
  • An Extract API to be used for processing pages.

The above image shows how to extract all the articles from Diffbot's old knowledgebase.

Passing Extract API Querystring Arguments

Bulk Extract serves as the controller for sending pages to the appropriate Extract API for processing/extraction. By default, these will be generic requests to the appropriate API and will return the default fields from that API. Each of these APIs has optional querystring arguments that can be used to modify the information returned -- most commonly the fields argument, for adding optional fields to the Diffbot response.

For example, Bulk URLs handed to the Article API will be equivalent to calling https://api.diffbot.com/v3/article?url=[url] and adding a querystring like &timeout=10000 results in the same being applied to the API call: https://api.diffbot.com/v3/article?url=[url]&timeout=10000

Notifications

You can choose to be notified at the conclusion of each bulk job, either by webhook or email.

For webhook notifications, you will need to supply a URL that is capable of receiving a POST request.

Accessing Bulk Job Data

🚧

Completed or paused bulk jobs will be automatically deleted after 10 days.

You can access processed data anytime during your bulk job, or after it completes. There are two download options within the Dashboard interface:

  • Full JSON Output: A single file, in valid JSON, containing all of the processed objects from your job.
  • CSV Output: A single comma-separated-values file of the top-level objects. Nested elements (article images, tags, etc.) will not be returned in the CSV.

If you only want to access a subset of your data, Search (DQL API) allows much more flexibility in searching and retrieving only the matching items from queries in your bulk collection.

Speed and Results Ordering

The Bulk API/Bulk Processing service simultaneously extracts data from multiple pages at once, and indexes the data as it is returned from Extract APIs.

Because many factors enter in to when data is successfully returned — among them the speed of a site’s response, the need for retries if a site returns a temporary error, and the potential for incorrect or invalid URLs — there is no reliable order to the output of a downloaded CSV or JSON result set.

The performance of Diffbot’s Bulk processing web extraction service depends on many of the same factors. The most common reason for a bulk job returning data more slowly than expected, however, is if a job’s URLs are from a limited number of sources and/or if a job’s URLs are from very popular sources. Diffbot has a global queue in order to maintain a level of politeness toward individual domains and IP addresses (and prevent overloading individual servers and sites). A bulk job with URLs from a single domain will finish much more slowly than one with URLs from many different locations.

Note that spreading out URLs across many different jobs will have no performance effect, as our global queue prevents visiting a single site too often from any part of Diffbot’s infrastructure.