18845 Group Project abstract
Title: MapReduce Application and Evaluation
Author: Jipeng Han (jipengh) and Xia Wu (xiaw)
Dealing with dataset in Machine learning is a big problem and a time
consuming work. Training and learning in machine learning require
processing large numbers of data, which often take several days to
complete, even the smallest dataset may need several hours to
finish. From the research, we found machine learning algorithms which
fit the Statistical Query model can be written in a summation and thus
can be easily parallelized. The aim of the project is to speed up some
machine learning algorithm using MapReduce programming model proposed
by Google, which is used for parallel computation of large scale of
dataset.
In this project, we plan to implement these learning algorithm using
Hadoop: K-means, Naive Bayes, SVM and Linear Regression. And compare
the results with the result of traditional sequential computing,
specifically, we expect to show that the result of these two
approaches is the same but the performance is highly improved. We will
also compare the performance as the number of computing nodes changed
in the MapReduce.