Frontpage API
Request
To use the Frontpage API, perform a HTTP GET request on the following endpoint:
http://www.diffbot.com/api/frontpage?token=...&url=...
Provide the following arguments:
Parameter | Description |
---|---|
token | Developer token |
url | Frontpage URL from which to extract items (URL encoded) |
Optional parameters | |
timeout | Specify a value in milliseconds (e.g., &timeout=15000 ) to override the default API timeout of 5000ms.
|
format | Format the response output in xml (default) or json |
all | Returns all content from page, including navigation and similar links that the Diffbot visual processing engine considers less important / non-core. |
Basic authentication | |
To access pages that require a login/password (using basic access authentication), include the username and password in your url parameter, e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com |
Alternatively, you can POST the content to analyze directly to the same endpoint. Specify the Content-Type
header as either text/plain
or text/html
.
Response
DML (Diffbot Markup Language) is an XML format for encoding the extracted structural information from the page. A DML consists of a single info section and a list of items.
Info field | Type | Description |
---|---|---|
id | long | DMLID of the URL |
title | string | Extracted title of the page |
sourceURL | url | the URL this was extracted from |
icon | url | A link to a small icon/favicon representing the page |
numItems | int | The number of items in this DML document |
Some of the fields found in Items
Item field | Type | Description |
---|---|---|
id | long | Unique hashcode/id of item |
title | string | Title of item |
description | string | innerHTML content of item |
xroot | xpath | XPATH of where item was found on the page |
pubDate | timestamp | Timestamp when item was detected on page |
link | URL | Extracted permalink (if applicable) of item |
type | {IMAGE,LINK,STORY,CHUNK} | Extracted type of the item, whether the item represents an image, permalink, story (image+summary), or html chunk. |
img | URL | Extracted image from item |
textSummary | string | A plain-text summary of the item |
sp | double<-[0,1] | Spam score - the probability that the item is spam/ad |
sr | double<-[1,5] | Static rank - the quality score of the item on a 1 to 5 scale |
fresh | double<-[0,1] | Fresh score - the percentage of the item that has changed compared to the previous crawl |