Article API: HTML Field Specification
HTML Field Specification
Diffbot's html
field returns normalized HTML maintaining the structure and layout of the source article, while standardizing its element and attributes for reliable parsing and processing. Content will be normalized into the following elements and attributes:
Element | Attributes | Description |
---|---|---|
* |
data-* |
As of January 2017 normalized HTML will retain and return data-* attributes. |
Block elements | ||
p |
— | Unless returned within a more specific element below, all text will be returned within p elements at the top-level of the HTML response. |
h1 - h5 |
— | Headers will be maintained if originally provided. |
aside |
— | Returned at top-level of HTML response. |
blockquote |
— | Returned at top-level of HTML response. |
code , pre |
— | Returned at top-level of HTML response. |
ul , ol |
start |
Returned at top-level of HTML response. |
li |
— | |
table |
Original content within table elements will be largely retained, including images and other media items. |
|
tbody |
— | |
th |
— | |
tr |
— | |
td |
valign , colspan |
|
dl |
— | Returned at top-level of HTML response. |
dt |
— | |
dd |
— | |
Inline elements | ||
br |
— | Single linebreaks entities will be maintained in markup and returned as <br> . Double-linebreaks will be removed and surrounding content will be returned within p block elements. |
b , strong |
— | Inline emphasis tags will be retained inside of other elements. |
i , em |
— | |
u |
— | |
sup |
— | |
sub |
— | |
a |
href |
Anchor tags and their href values will be retained. |
Media | ||
figure |
— | Media elements will be returned at the top-level of the HTML content and contained within figure tags. |
img |
src , alt , srcset , sizes |
Image layout specifics (floats, etc.) and CSS-specified widths/heights will be discarded. |
video /audio |
src |
The child source elements within video and audio elements will be retained along with the type attribute, if provided. |
source |
src , type , srcset , sizes |
|
figcaption |
— | If present, media captions will be returned as figcaption elements within the figure container. |
iframe |
src , frameborder |
|
embed , object |
src , type |
Example HTML Response
<p>Diffbot's human wranglers are proud today to announce the release of our newest product: an API for... products!</p> <p>The <a href="http://www.diffbot.com/products/automatic/product">Product API</a> can be used for extracting clean, structured data from any e-commerce product page. It automatically makes available all the product data you'd expect: price, discount/savings amount, shipping cost, product description, any relevant product images, SKU and/or other product IDs.</p> <p>Even cooler: pair the Product API with <a href="http://www.diffbot.com/products/crawlbot">Crawlbot</a>, our intelligent site-spidering tool, and let Diffbot determine which pages are products, then automatically structure the entire catalog. Here's a quick demonstration of Crawlbot at work:</p> <figure> <iframe frameborder="0" src="http://www.youtube.com/embed/lfcri5ungRo?feature=oembed"></iframe> </figure> <p>We've developed the Product API over the course of two years, building upon our core vision technology that's extracted structured data from billions of web pages, and training our machine learning systems using data from tens of thousands of unique shopping sites. We can't wait for you to try it out.</p> <p>What are you waiting for? Check out the <a href="http://www.diffbot.com/products/automatic/product">Product API documentation</a> and dive on in! If you need a token, check out our <a href="http://www.diffbot.com/pricing">pricing and plans</a> (including our Free plan).</p> <p>Questions? Hit us up at <a href="mailto:support@diffbot.com">support@diffbot.com</a>.</p>