December 30, 2019

432 words 3 mins read

Stream analytics with SQL on Apache Flink

Stream analytics with SQL on Apache Flink

Although the most widely used language for data analysis, SQL is only slowly being adopted by open source stream processors. One reason is that SQL's semantics and syntax were not designed with streaming data in mind. Fabian Hueske explores Apache Flink's two relational APIs for streaming analyticsstandard SQL and the LINQ-style Table APIdiscussing their semantics and showcasing their usage.

Talk Title Stream analytics with SQL on Apache Flink
Speakers Fabian Hueske (data Artisans)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 26-28, 2017
URL Talk Page
Slides Talk Slides

SQL is undoubtedly the most widely used language for data analytics. It is declarative; many database systems and query processors feature advanced query optimizers and highly efficient execution engines; and last but not least, it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?” One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap. Apache Flink is a distributed stream processing system. Due to its support for event-time processing, exactly once state semantics, and its high-throughput capabilities, Flink is very well suited for streaming analytics. Flink features two relational APIs for unified stream and batch processing: the Table API and SQL. The Table API is a language-integrated relational API, and the SQL interface is compliant with standard SQL. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. A core principle of both APIs is to provide the same semantics for batch and streaming data sources, meaning that a query should compute the same result regardless whether it was executed on a static dataset, such as a file, or on a data stream, like a Kafka topic. Fabian Hueske offers an overview of the semantics of Apache Flink’s relational APIs for stream analytics. Both APIs are centered around the concept of dynamic tables, which are defined on data streams and behave like regular database tables. Queries on dynamic tables produce new dynamic tables, which are similar to materialized views as known from relational database systems. Fabian demonstrates how dynamic tables are defined on streams, how dynamic tables are queried, and how the results are converted back into changelog streams or written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and updated in place with low latency.

comments powered by Disqus