case study

How we helped Avast analyze the Web's Privacy Policies

When the world's largest consumer cybersecurity company—Avast—was looking to develop a universal privacy score for every site on the web, they turned to Diffbot, the experts in automated web-scale extraction, to help them ship the project in record time.

Their system is so flexible that they fine tuned it and were able to develop the Privacy Policy API really quickly. I would definitely recommend working and collaborating with Diffbot to any company that needs structured data from the web. Partnering with Diffbot was absolutely the right decision.

- Galina Alperovich, Staff Scientist, Avast

About Avast

Founded in 1988, Avast makes the popular suite of consumer security applications Avast Anti-virus, Browser, and VPN, used by over 435 million active users on a monthly basis and with revenues of over $900 million. Avast recently merged with NortonLifeLock (formerly Symantec) to establish the world’s largest consumer cybersecurity company, with jointly thousands of employees across the US and Europe working on AI, machine learning, and security.

Avast’s origins started with protecting consumers’s desktop computers from viruses and malware, but as people shifted more of their lives online, they broadened their product suite to help protect consumers on any device and against the wide range of intenet threats: web security, data breeches, online tracking, and privacy.

The Problem

Avast knew that their customers wanted tools to help them feel safe browsing the web, but lacked a key piece of information needed in their scoring models that rate the trustworthiness of online websites and applications: information about that site’s privacy practices.

While data about e.g. which sites suffered from data breaches or were reported for phishing are very clear signals, these are relatively small datasets, a drop in the bucket when compared to the vast number of websites on the web that the user might want an assessment on.

In contrast, due to recent regulations like GPDR and CCPA, a large portion of the internet’s websites have a privacy policy that can be analyzed for their privacy posture, and the ommission of a privacy policy says something about a company’s privacy stance as well.

But gathering all the web’s privacy policies, and maintaining a pipeline to receive updates on every company in the world’s privacy policy would be a massive undertaking. Not only do very few organizations have expertise in realtime crawling of the entire web (Diffbot is the only US-entity besides Google and Bing that maintains a commercial web crawl), but they would have to develop many sophisticated machine learning models to identify the subsets of the web that are privacy policies, and the NLP models to read and understand the text of privacy policies at human-level.

Developing this pipeline in-house would be an expensive undertaking both in human resources, machine infrastructure to host a web-scale crawl—and most importantly—time-to-market.

The Solution

This is when Chandler Givens, Product Director of Consumer Privacy at Avast, connected with our founder and CEO Mike Tung, recognizing our expertise in automated web extraction and developing the pipelines that construct the world’s largest Knowledge Graph. After a quick meeting with Chandler and the Avast data science team, Diffbot proposed leveraging two key components of Diffbot’s knowledge graph pipeline in order to solve their problem:

Automatic page classification and text extraction. Diffbot has the world-class system for webpage classification and automatic webpage extraction. Diffbot’s extraction API automatically visually classifies pages into one of the 20 common types of pages on the web at greater than 97% accuracy without rules and has been battle-tested on nearly every page on the web after years of production crawling and serving customers like Bing, StumbleUpon, and DuckDuckGo. Additionally, Diffbot has the most accurate webpage-to-text extraction algorithms, relying on its visual techniques, customized fork of the Chromium rendering engine and years of extracting clean article text for companies like Instapaper, Snapchat, and AOL. By fine-tuning these models instead of developing from scratch, we were able to build a production grade extractor for privacy policies with a small amount of training data in a fraction of the amount of time.
Firmographic enrichment from the web domain To model the privacy stance of an organization, you need to first understand what organization corresponds to the website you are currently visiting. While this might sound trivial to a human browsing the website, implementing this an automated way is not trivial at all. To implement this map we used Diffbot’s Enhance API to pull in the Organization record from the over 300M available Organizations in Diffbot Knowledge Graph, including all of the rich firmographic detail available such as the firm’s nubmer of employees, location, and industry, which provided additional signals for the trustworthiness of the entity.

Results

By combining Diffbot’s automatic classification and extraction with Diffbot’s Knowledge Graph technology, Diffbot was able to work with Avast’s data science team to ship the Privacy Policy analysis endpoint and integrate into the production pipeline within 4 weeks.

Classification accuracy of the Privacy Policy API:

	Precision	Recall
Overall Performance	93.4%	100%
- English	94.9%	100%
- Non-English	91.1%	100%
Training Performance	100%	100%

(10-fold cross-validation error, N=1108)

Since we were able to ship a production-quality pipeline so quickly and had so much fun working with Avast’s data science team so much in the process, we decided to collaborate with Avast on a series of blog posts to educate consumers about privacy and a dataset for academic researchers studying privacy. You can read more about the collaboration on the official Avast blog here: