ABSTRACT
Stream processing has emerged as a paradigm for applications that require low-latency evaluation of operators over unbounded sequences of data. Defining the semantics of stream processing is challenging in the presence of distributed data sources. The physical and logical order of data in a stream may become inconsistent in such a setting. Existing models either neglect these inconsistencies or handle them by means of data buffering and reordering techniques, thereby compromising processing latency.
In this paper, we introduce the Dual Streaming Model to reason about physical and logical order in data stream processing. This model presents the result of an operator as a stream of successive updates, which induces a duality of results and streams. As such, it provides a natural way to cope with inconsistencies between the physical and logical order of streaming data in a continuous manner, without explicit buffering and reordering. We further discuss the trade-offs and challenges faced when implementing this model in terms of correctness, latency, and processing cost. A case study based on Apache Kafka illustrates the effectiveness of our model in the light of real-world requirements.
- Daniel Abadi et al. 2003. Aurora: A New Model and Architecture for Data Stream Management. The VLDB Journal 12, 2 (2003), 120--139. Google ScholarDigital Library
- Daniel Abadi et al. 2005. The Design of the Borealis Stream Processing Engine. In CIDR, 2nd Biennial Conf. on Innovative Data Systems Research. 277--289.Google Scholar
- Tyler Akidau et al. 2013. MillWheel: Fault-tolerant Stream Processing at Internet Scale. Proc. VLDB Endow. 6, 11 (2013), 1033--1044. Google ScholarDigital Library
- Tyler Akidau et al. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-scale, Unbounded, Out-of-order Data Processing. Proc. VLDB Endow. 8, 12 (2015), 1792--1803. Google ScholarDigital Library
- Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. CQL: A Language for Continuous Queries over Streams and Relations. In Database Programming Languages, 9th Int. WS. 1--19.Google Scholar
- Brian Babcock et al. 2002. Models and Issues in Data Stream Systems. In Proc. of the 21st ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems. 1--16. Google ScholarDigital Library
- Shivnath Babu and Jennifer Widom. 2001. Continuous Queries over Data Streams. SIGMOD Records 30, 3 (2001), 109--120. Google ScholarDigital Library
- Roger Barga et al. 2007. Consistent Streaming Through Time: A Vision for Event Stream Processing. In CIDR, 3rd Biennial Conf. on Innovative Data Systems Research. 363--374.Google Scholar
- Jose A. Blakeley, Per-Ake Larson, and Frank Wm Tompa. 1986. Efficiently Updating Materialized Views. SIGMOD Record 15, 2 (1986), 61--71. Google ScholarDigital Library
- Badrish Chandramouli et al. 2014. Trill: A High-performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401--412. Google ScholarDigital Library
- Gianpaolo Cugola and Alessandro Margara. 2012. Processing flows of information: From data stream to complex event processing. ACM Comput. Surv. 44, 3 (2012), 15:1--15:62. Google ScholarDigital Library
- Nihal Dindar et al. 2013. Modeling the Execution Semantics of Stream Processing Engines with SECRET. The VLDB Journal 22, 4 (Aug. 2013), 421--446. Google ScholarDigital Library
- Jim Gray et al. 1997. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Mining and Knowledge Discovery 1, 1 (1997), 29--53. Google ScholarDigital Library
- H. V. Jagadish, Inderpal Singh Mumick, and Abraham Silberschatz. 1995. View Maintenance Issues for the Chronicle Data Model (Extended Abstract). In Proc. of the 14th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. 113--124. Google ScholarDigital Library
- Namit Jain et al. 2008. Towards a Streaming SQL Standard. Proc. VLDB Endow. 1, 2 (2008), 1379--1390. Google ScholarDigital Library
- Jay Kreps, Neha Narkhede, and Jun Rao. 2011. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB. 1--7.Google Scholar
- Sailesh Krishnamurthy et al. 2010. Continuous Analytics over Discontinuous Streams. In Proc. of the 2010 ACM SIGMOD Int. Conf. on Management of Data. 1081--1092. Google ScholarDigital Library
- Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. 2004. Query Languages and Data Models for Database Sequences and Data Streams. In Proc. of the 13th Int. Conf. on Very Large Data Bases. 492--503. Google ScholarDigital Library
- James Lewis and Martin Fowler. 2014. Microservices: a definition of this new architectural term. https://www.martinfowler.com/articles/microservices.htmlGoogle Scholar
- Jin Li et al. 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data. 311--322. Google ScholarDigital Library
- Jin Li et al. 2008. Out-of-order Processing: A New Architecture for High-performance Stream Systems. Proc. VLDB Endow. 1, 1 (2008), 274--288. Google ScholarDigital Library
- Ling Liu, Calton Pu, and Wei Tang. 1999. Continual Queries for Internet Scale Event-Driven Information Delivery. IEEE Transactions on Knowledge Data Engineering 11, 4 (1999), 610--628. Google ScholarDigital Library
- Ling Liu et al. 1996. Differential Evaluation of Continual Queries. In Proc. of the 16th Int. Conf. on Distributed Computing Systems. 458--465. Google ScholarDigital Library
- David Maier et al. 2005. Semantics of Data Streams and Operators. In Proc. of the 10th Int. Conf. on Database Theory. 37--52. Google ScholarDigital Library
- Utkarsh Srivastava and Jennifer Widom. 2004. Flexible Time Management in Data Stream Systems. In Proc. of the 23rd ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems. 263--274. Google ScholarDigital Library
- Douglas Terry et al. 1992. Continuous Queries over Append-only Databases. In Proc. of the 1992 ACM SIGMOD Int. Conf. on Management of Data. 321--330. Google ScholarDigital Library
- Peter Tucker et al. 2003. Exploiting Punctuation Semantics in Continuous Data Streams. IEEE Transactions on Knowledge Data Engineering 15, 3 (2003), 555--568. Google ScholarDigital Library
Index Terms
- Streams and Tables: Two Sides of the Same Coin
Recommendations
Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka
SIGMOD '21: Proceedings of the 2021 International Conference on Management of DataAn increasingly important system requirement for distributed stream processing applications is to provide strong correctness guarantees under unexpected failures and out-of-order data so that its results can be authoritative (not needing complementary ...
Dual-Paradigm Stream Processing
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingExisting stream processing frameworks operate either under data stream paradigm processing data record by record to favor low latency, or under operation stream paradigm processing data in micro-batches to desire high throughput. For complex and mutable ...
Pre-processing and data validation in IoT data streams
DEBS '20: Proceedings of the 14th ACM International Conference on Distributed and Event-based SystemsIn the last few years, distributed stream processing engines have been on the rise due to their crucial impacts on real-time data processing with guaranteed low latency in several application domains such as financial markets, surveillance systems, ...
Comments