The Anatomy of a Small-Scale Hypertextual Web Search Engine

Adhish Ramkumar
Robert Maratos
Carnegie Mellon University

ABSTRACT

This paper studies the feasibility of implementing a search engine on
small web domains and how the web and commodity hardware has changed
since the early 2000s. To give a quantity to this measurement, this
paper evaluates how large the fan-out is when a web crawler crawls
Carnegie Mellon University sub domains, how quickly web crawlers are
able to recover in the face of failure, the percentage of dead links
encountered, and how long it takes to index specific sub-domains.

The system detailed in this paper was initially limited to only
scraping pages under the Carnegie Mellon University: Department of
Electrical & Computer Engineering domain (ece.cmu.edu), in order to
avoid storage limitations that may have arisen due to the sheer volume
of pages published by the university. After evaluating the storage
needs of the smaller domain, the paper analyzes the performance of the
system as it scales up to include all pages published by Carnegie
Mellon University (www.cmu.edu).