To access this work you must either be on the Smith College campus OR have valid Smith login credentials.

On Campus users: To access this work if you are on campus please Select the Download button.

Off Campus users: To access this work from off campus, please select the Off-Campus button and enter your Smith username and password when prompted.

Non-Smith users: You may request this item through Interlibrary Loan at your own library.

Publication Date

2025-5

First Advisor

Shinyoung Cho

Document Type

Honors Project

Degree Name

Bachelor of Arts

Department

Computer Science

Keywords

BERT, OCSVM, isolation forest, machine learning, deep learning, block-pages, OONI, fingerprints, measurement data, automated detection, feature selection

Abstract

Block pages are web pages delivered in place of expected content, ranging from blank screens or generic error messages to detailed notices explaining access restrictions. While they are not the primary mechanism of censorship, block pages serve as valuable signals that, when systematically detected, can reveal critical insights into how and why Internet censorship is implemented. Traditional heuristic techniques—such as fingerprint matching, word frequency analysis, response length comparisons, and manual verification—are often labor-intensive, brittle, and difficult to scale. As web content becomes increasingly dynamic and region-specific, these methods quickly become outdated, especially in global detection contexts. This thesis introduces a scalable and adaptive detection pipeline that leverages machine learning (ML) and deep learning (DL) techniques to automate the identification of block pages and discover emerging censorship fingerprints. Drawing on up-to-date measurement data from the Open Observatory of Network Interference (OONI), our system is designed to adapt to the evolving nature of censorship practices across diverse networks and geographies. Our best-performing model achieved an accuracy of 96%, demonstrating strong potential for real-world deployment. We evaluate the pipeline using two weeks of OONI data, capturing temporal, geographic, and network-level diversity in web connectivity. Our key contributions include: (1) integrating state-of-the-art ML and DL models for adaptive block page detection; (2) uncovering previously undocumented censorship fingerprints; and (3) offering practical insights into the effectiveness and limitations of automated detection methods. To support continued research and transparency in this domain, we will release our models, detection pipeline, and source code via a public GitHub repository. By advancing the automation of censorship detection, this work contributes to global efforts to safeguard Internet freedom and uphold digital rights.

Rights

©2025 Paola Calle. Access limited to the Smith College community and other researchers while on campus. Smith College community members also may access from off-campus using a Smith College log-in. Other off-campus researchers may request a copy through Interlibrary Loan for personal use.

Language

English

Comments

vii, 47 pages: color illustrations, charts. Includes bibliographical references (pages 43-47).

Share

COinS