Undergrad Research Project - Longitudinal Data Collection for Internet Censorship

Spring 2016

Srujana Peddada
Nicolas Christin
Project description

The extent of Internet censorship varies greatly by country. This research aims to get a better understanding of the topics and content censored around the world. Proxy servers are used to access a large set of websites as though from the respective countries to observe censorship patterns. If a government notice saying the content is censored is received in response to accessing a webpage, the page is classified as censored in that country. However, this only addresses the easy case since it is also important to classify pages as censored when they fail to load. A distinction needs to be drawn between persistent failures, where the page is always inaccessible in a country and transient failures, where the page is inaccessible at the time it is scanned. Longitudinal data collection must therefore run continuously to obtain observations of each page at many points in time to be able to make this distinction. To run the collector continuously, several manual operations must be automated, which is the primary goal of my project. It will automatically populate a database with results for further analysis. A status display will also be created to help track data collection. Once the system is running automatically, the final part of the project will focus on determining what changes to a webpage indicate a censorship event.

