Undergrad Research Project - Distributed Image Processing Database with Spark

Spring 2015

Xiaofan Li
Kayvon Fatahalian
Project description

We aim to create a distributed database of images use the abstractions of RDD (Resilient Distributed Datasets) to achieve high-performance image processing and machine learning algorithms on big datasets. Such a system can be useful in many real world applications such as image searching, 3D reconstruction, facial recognition or even just applying filters to billions of pictures. We are first going to use the Spark framework and HDFS (Hadoop file system) to interact with the dataset and create necessary API to run currently available algorithms on the cloud. Then we will take a closer look at the low-level details of Spark and perhaps improve its performance by modifying Spark or adding on specialized hardware to nodes in the cluster. The anticipated result should be a system that can support image processing on large dataset with both reliability guarantees and good performances to a degree that was previously impossible to achieve. Skills required includes understanding of large parallel/distributed systems, ability to analyze and improve performance, coding skills with C/C++, Java/Scala, and understanding of image processing and machine learning algorithms etc.

Return to project list