Stream Processing Design and Implementation

Anuj Patel

Carnegie Mellon University
anujp@andrew.cmu.edu

ABSTRACT

Distributed streaming architectures are becoming more and more
necessary in building effective realtime applications. Requirements of
these systems include efficient message passing, fault tolerance, and
keeping latency to a minimum. Existing solutions such as Google’s
MillWheel and LinkedIn’s Samza address these issues
differently. MillWheel uses a distributed time keeping service called
watermarks to minimize latency. Samza uses the publish-subscribe
service Kafka in order to allow for a consumer driven approach to
processing data to give low latency . This is a great advantage for
companies who lack MillWheel’s complex supporting storage, BigTable.
However, it is does not provide the same set of features that can be
created by keeping some notion of time in the system.  

This paper will
explain the different approaches and stages to stream processing and
evaluate a new design drawing from both services. In particular, we
will see how an optimized messaging system, a fault tolerant log, a
timer monitor service, and a simple API can be used to build streaming
applications. Our results show that different messaging designs, fault
tolerance techniques, and monitoring methods can have a significant
impact on performance.