How to introduce FlinkSQL in Cloudera flow Analysis 04/29 Update SLTechnology News&Howtos

How to introduce FlinkSQL in Cloudera flow Analysis

2025-04-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to introduce FlinkSQL in Cloudera flow analysis. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Version 1.2.0.0 of Cloudera Streaming Analytics, supported by Apache Flink, provides a wide range of new features, including support for consanguinity and metadata tracking through Apache Atlas, support for connecting to Apache Kudu, and the long-awaited first iteration of FlinkSQL API.

Flink's SQL interface democratizes streaming because it caters to a larger community than the currently widely used Java and Scala API (for data engineering people). Extending SQL to flow processing and flow analysis use cases presents a series of challenges: we must address the timeliness of expressing unlimited flow and recording. Let's consider the following query: SELECT userId

COUNT (*) AS count

SESSION_START (clicktime

INTERVAL '30' MINUTE) FROM clicksGROUP BY

SESSION (clicktime, INTERVAL '30' MINUTE)

UserId this query produces a click count for each user session, which is defined by 30 minutes of inactivity between sessions and updated in real time when a new session is encountered. This is an example of a concept that has been well established in flow processing, in which case the session window is introduced into the SQL syntax to indicate the timeliness of the record. It is important to emphasize that the syntax supported by Flink is ANSI SQL, which is not a specific dialect. In fact, the Flink community is working with the Apache Beam and Apache Calcite communities to address the challenges of FlinkSQL in a unified manner. From the point of view of the above query, it is clear that a larger user base can effectively develop queries, thereby adding value to the enterprise. However, it poses the following problems for the organization:

1) how much business logic can be developed with SQL in the field of streaming media?

2) how does this change the streaming journey from development to production?

3) how does this affect the scope of the data engineering team?

We believe that most of the streaming queries written today can be represented through FlinkSQL to provide informed guesses, and we want it to reach about 80% of the streaming queries encountered today, which is suitable for implementing API through this SQL. First of all, this seems to be a bit of an exaggeration, which we will cover in detail in the next section. Today, we often encounter organizations that use Flink, where getting business value in near real time is the prerogative of data engineers. Data analysts are usually domain-specific experts who tend to use snapshots of these streams stored in standard MPP or OLAP systems, such as querying data stored in Kudu through Apache Impala. This essentially introduces the search for insight and production in the way of flow. After confirming their hypothesis, analysts must work with several data engineers to ensure weeks or even months of project funding to carefully re-implement business logic that has been developed in another language (usually SQL). FlinkSQL enables analysts to interact directly with the flow and deploy the flow job at the click of a button. This, in turn, frees data engineers to focus on challenging 20% queries and to build reusable domain-specific libraries that can be utilized directly from SQL as a set of user-defined functions.

Features of FlinkSQL to demonstrate the functionality of FlinkSQL, we recently released SQL tutorials under our standard tutorial suite. Let's focus on some features here.

The editor operates on the Apache Kafka topic, which contains transaction entries in JSON format. Let's define a table Schema for this and specify that we want to measure the elapse of time (called event-time semantics) recorded by the timestamp column. CREATE TABLE ItemTransactions (transactionId BIGINT, `timestamp` BIGINT,itemId STRING,quantity INT,event_time AS CAST (from_unixtime (floor (`timestamp` / 1000)) AS TIMESTAMP (3), WATERMARK FOR event_time AS event_time-INTERVAL'5' SECOND) WITH ('connector.type' =' kafka','connector.version' = 'universal','connector.topic' =' transaction.log.1','connector.startup-mode' = 'earliest-offset','connector.properties.bootstrap.servers' ='' 'format.type' =' json') Note that when using event time semantics, we must specify the watermark to provide Flink with a heuristic to measure the event time pass. This can be any expression that returns a timestamp. At a higher level, the watermark specifies a tradeoff between correctness (waiting indefinitely for potential delays to arrive) and delays (producing results as quickly as possible). After creating the above table, we can submit the following query:

SELECT * FROM ItemTransactions LIMIT 10 select TUMBLE_START (event_time, INTERVAL '10' SECOND) as window_start, itemId, sum (quantity) as volumeFROM ItemTransactionsGROUP BY itemId, TUMBLE (event_time, INTERVAL' 10' SECOND); the first query provides direct sampling. Using the cancel clause is optional, and omitting causes the result to be constantly updated in a stream. The second query implements a simple window aggregation. The results of these queries can be returned to the interactive Flink SQL cli, or they can be written directly to the output table through the INSERT INTO statement. FlinkSQL also provides more complex clauses, for example, you can use the following formula to find the top 3 items that have the most transactions in every 10-minute window:

SELECT * FROM (SELECT *, ROW_NUMBER () OVER (PARTITION BY window_start ORDER BY num_transactions desc) AS rownum FROM (SELECT TUMBLE_START (event_time, INTERVAL '10' MINUTE) AS window_start, itemId, COUNT (*) AS num_transactions FROM ItemTransactions GROUP BY itemId, TUMBLE (event_time, INTERVAL' 10' MINUTE)) WHERE rownum

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.