C Implementation of Google.s MapReduce . A Simplified Data Processing On Large Clusters Supriya Kher, Xiaohui Wang, and Shivani Kirubanandan The project aims at implementing the MapReduce programming model. The MapReduce model aims at processing large datasets at a faster rate, this functional style of programming is automatically parallelized to work on large clusters. In this model a user program forks into a master and workers. These workers help in parallel retrieval of the input data sets (splits of 16- 64 MB), which are then generated as key-value pairs. The project shall initially implement the map-reduce on a single system, with the master and workers as processes. The MapReduce implementation shall in the future be ported to a distributed environment with appropriate master and worker stubs that communicate via RPCs. There are 2 kinds of http applications we are considering for now, to be implemented with this model: -Distributed Grep: Here a particular keyword is searched across the set of documents located across the servers and the keyword along with the associated document id is retrieved. -Page Ranking: Here depending on the relative frequencies of a particular word occurring in the document, the pages are ranked. Hence if a word a occurs more frequently in doc 1 compared to doc 2, then doc 1 is ranked higher in relevance than doc 2. MapReduce achieves reliability in a distributed environment by parceling out a number of operations on the set of data to each node in the network. Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node records the node as dead, and sends out the node's assigned data to other nodes. The end-user shall view this in the form of a web service with the above mentioned application support. The target implementation shall be tested on Emulab to evaluate scalability and fault tolerant aspects of the implementation. It shall also be tested on a cluster of machines to evaluate the performance in a real world scenario.