Stream Processing Design and Implementation Anuj Patel Carnegie Mellon University anujp@andrew.cmu.edu ABSTRACT Distributed streaming architectures are becoming more and more necessary in building effective realtime applications. Requirements of these systems include efficient message passing, fault tolerance, and keeping latency to a minimum. Existing solutions such as Google’s MillWheel and LinkedIn’s Samza address these issues differently. MillWheel uses a distributed time keeping service called watermarks to minimize latency. Samza uses the publish-subscribe service Kafka in order to allow for a consumer driven approach to processing data to give low latency . This is a great advantage for companies who lack MillWheel’s complex supporting storage, BigTable. However, it is does not provide the same set of features that can be created by keeping some notion of time in the system. This paper will explain the different approaches and stages to stream processing and evaluate a new design drawing from both services. In particular, we will see how an optimized messaging system, a fault tolerant log, a timer monitor service, and a simple API can be used to build streaming applications. Our results show that different messaging designs, fault tolerance techniques, and monitoring methods can have a significant impact on performance.