Diffbot Challenge

Welcome to Diffbot's Machine Learning Challenge!

My name is Diffy. I will be your guide in the challenge.

If you have questions that are not answered on the website, don't hesitate to contact the Diffbot team!

The goal

In this contest, you will compete on a content extraction task. The goal is to extract relevant images from article web pages. The correct images are the images actually related to the article. All the other images are considered irrelevant. For instance, adds and share buttons are to be discarded, as well as web-site banner and images in the related articles section.

Example

OK, it's time to look at an example. Here is a link to an article of the Ars Technica:

http://arstechnica.com/gadgets/2012/12/the-ultimate-smartphone-guide-part-ii-being-entertained-and-going-social/

The link does not lead you to arstechnica.com but to an archive stored by Diffbot. This way, everybody we will be able to access the page even after it is deleted from the Ars Technica web-site. It also ensures we all see the exact same page. Else, commercials and other things on the page may differ from one user to the other.

On the page, you can easily see which images are relevant to the article. Your goal will be to build a program that selects the relevant images from any article page.

Machine Learning

The extraction task can be solved using different machine learning techniques. For instance, it can be considered as a classification problem: for every image on the page, classify it as relevant or irrelevant. Another way to see it is as a candidate selection problem: given all the images on the page, select the ones that are most relevant to the article.

Challengers are free to use any technique they see fit to extract images relevant to the article. We encourage innovative ways of solving this problem.

We also would like to indicate that we will not use your code for commercial purposes.

If you are interested in participating, it is now time to hit the 'Getting Started' button!

To get started on the project, you should follow these steps:

Register your team.
Download the sample data set.
Build a first version of your program with minimal logic.
Run your program on the sample set and construct a valid submission file.

Data set

Reading the data

The sample zip contains a JSON file ('labels.json') and a bunch of files containing raw html code. Each of these files contains the source code of one html page.

The JSON file contains labeling information for all of the archives. Here is how it is organized:

[
    {   
        "url": "original url of the web page", 
        "id": "name of the file containing the corresponding HTML.",
        "xpath": "xpath string pointing images labeled as relevant"
    },
    ...
]

Every field is pretty self-explanatory for those familiar with web development. Each URL has an associated archive. The archive file contains the HTML code with the contents of the page. You should be able to transform the string found in this file into an DOM object (a tree) using an HTML parser. Every language has plenty of HTML parsers available.

The labeling information is contained in the "xpath" field. XPath is a query language useful to select nodes in a DOM object. Here, the "xpath" points to all the image nodes labeled as relevant. There are plenty of open libraries you can use for XPath node selection.

Diffbot additions

Diffbot added rendering information to the archives to help computing features for the diffferent images. You are fee to use it or not, but you are strongly invited to, since it will help a lot in computing more precise features. To 'render' a page means to determine how it should be showed to a user in their browser. Diffbot renders the page (like browsers do) to fit a 1024 pixels width.

So in the archives, each element in the DOM has an additional HTML attribute called "_". If you parse this attribute, you can access the following information:

x : x-coordinate (in pixels)
y : y-coordinate (in pixels)
w : width (in pixels)
h : height (in pixels)

Example: <img src="smartphone-feat2.jpg" _="x=22,y=392,w=640,h=360,dis=block">

Submission format

You will need to submit an JSON file. It must contain the name of your team under "team_name" and an object called "answers". "Answers" contains a list of JSON objects that each correspond to what you extracted from a page.

{
    "team_name": "the name of your team", 
    "anwsers":   
    [
        {   
            "url": "original url of the web page", 
            "id": "name of the file containing the corresponding raw html"
            "image_src":  
            [
                "source of selected image 1",
                "source of selected image 2",
                ...
            ],
        }
        ...
    ]
}

This file will be compared to our reference files to evaluate your program. More information about that in the 'evaluation' tab!

How to get the data and how can I use it?

You can download a sample of the data using this link: Download Sample

Full training set (about 600 pages): Download train set

Public test set (about 200 pages, without labels): Download public test set

It may not be easy to have an idea of what the page looks like by just looking at the HTML, so we here provide a complete replay of the given samples, with images.

You can see every example in the train and test sets using the following URLs:

  http://diffbot.com/robotlab/DiffbotContest/replay/archive/< id >/RENDERED

#	Archive Replay
1	http://arstechnica.com/gadgets/2012/12/the-ultimate-smartphone-guide-part-ii-being-entertained-and-going-social/
2	http://www.texasmonthly.com/2012-12-01/feature5.php
3	http://www.anandtech.com/show/6506/msi-fm2a85xag65-review-know-your-platform
4	http://www.slate.com/articles/technology/holidays/2012/12/parrot_ar_drone_2_0_quadricopter_and_air_hogs_battle_tracker_reviews_of.html
5	http://www.nytimes.com/2012/12/22/us/politics/next-move-is-obamas-after-boehners-tax-plan-fails.html
6	http://www.shape.com/weight-loss/weight-loss-strategies/ask-diet-doctor-how-can-shift-workers-lose-weight
7	http://www.shape.com/blogs/shape-your-life/new-year%E2%80%99s-eve-beauty-wear-glitter
8	http://gigaom.com/2013/03/13/chris-wetherll-google-reader/
9	http://www.thecmuwebsite.com/article/new-copyright-rules-in-spain-will-target-advertisers-on-piracy-sites/
10	http://www.latimes.com/news/local/la-me-pot-measures-20130422,0,2113230.story

Deadline

The submissions are due on December 10th, 2013, before 11:59pm .

Evaluation

The full training set and a sample test set are now available (see "Data set"). You should run your program on this test set and submit your JSON extraction file (see Getting Started and Submission). We will then score your extraction and update the leader board.

The day the contest ends, we will distribute the true testing set and give each group a short time window to run their program and return us their extraction JSON. The score on this final test set will determine the final ranking.

Ranking

Your image extraction algorithm will be scored using the Fscore function: it takes into account the precision (number of correct images among the images returned) and recall (number of correct images returned compared to the total number of correct images to return).

The team with the best Fscore will win the contest.

Submission

Send your JSON evaluation file to support@diffbot.com . We will update the leader board shortly.

Leaderboard

#	Group name	Precision	Recall	F-score
1	Relevanseek	55.3	72.2	62.7
2	groupB	0.0	0.0	0.0
3	groupC	0.0	0.0	0.0
4	groupD	0.0	0.0	0.0

What can I gain if I my algorithm performs best?

The top three (3) teams will each receive one Raspberry PI Model B per team member.

We will not use your code for commercial purposes. However, we may try to hire you if you perform well.

For any question concerning the contest, send an email to support@diffbot.com .

Check out our website at diffbot.com.

Register your team

Fill out the form here