Search for streaming Computing 07/12 Update SLTechnology News&Howtos

Search for streaming Computing

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Static data and stream data

Static data: a data warehouse system built to support decision analysis, in which a large amount of historical data is static data.

Stream data: data that continuously arrives in the form of a large, fast, time-varying stream. (for example, real-time generated logs, real-time user transaction information)

Stream data has the following characteristics:

(1) the data arrive quickly and continuously, and the potential size may be endless. (2) there are many data sources and complex formats. (3) A large amount of data, but does not pay much attention to storage, once processed, it is either discarded or archived (stored in the data warehouse). (4) pay attention to the overall value of the data, not pay too much attention to individual data. (5) the order of the data is reversed or incomplete, and the system cannot control the order of the newly arrived data elements to be processed.

In the traditional data processing process, the data is always collected first and then put into the DB. Then the data in DB is processed.

Stream computing: in order to achieve the timeliness of the data, the acquired data is consumed in real time.

Batch calculation and flow calculation

Batch computing: plenty of time to process static data, such as Hadoop. The real-time requirement is not high.

Stream computing: real-time acquisition of massive data from different data sources, after real-time analysis and processing, to obtain valuable information (real-time, multi-data structure, massive).

Stream computing adheres to the basic idea that the value of data decreases over time, such as the user clickstream. Therefore, events should be processed immediately when they occur, rather than cached for batch processing. Stream data has complex data format, many sources and huge amount of data, so it is not suitable to use batch computing. Real-time calculation must be adopted, the response time is seconds, and the real-time requirement is high. Batch computing focuses on throughput and stream computing on real-time performance.

Characteristics of stream computing:

1. Real-time (realtime) and * * (unbounded) data flow. Stream computing is real-time and streaming, and stream data is subscribed and consumed by the stream according to the order in which it occurs. And because of the persistence of the data, the data flow will be integrated into the stream computing system for a long time and continuously. For example, for the access log flow of a website, as long as the site does not close its click log stream, it will continue to generate and enter the stream computing system. Therefore, for the streaming system, the data is real-time and non-terminating (*).

2. Continuous (continuos) and efficient computing. Flow meter is a kind of "event trigger" computing mode, and the trigger source is the above-mentioned streaming data. As soon as new flow data enters the flow calculation, the flow calculation immediately initiates and carries out a calculation task, so the whole flow calculation is a continuous calculation.

3. Streaming and real-time data integration. The stream data triggers the calculation result of the primary stream calculation, which can be directly written to the destination data store, for example, the calculated report data is directly written to RDS for report display. Therefore, the calculation result of the stream data can be continuously written to the destination data store like the stream data.

III. Flow Computing Framework

In order to process stream data in time, a low-latency, scalable and highly reliable processing engine is needed. For a stream computing system, it should meet the following requirements:

High performance: deal with big data's basic requirements, such as processing hundreds of thousands of pieces of data per second.

Massive: support TB or even PB-level data scale.

Real-time: to ensure a low delay time, reaching the second level, or even the millisecond level.

Distributed: the basic architecture that supports big data must be able to scale smoothly.

Ease of use: ability to develop and deploy quickly.

Reliability: ability to handle stream data reliably.

At present, there are three kinds of common flow computing frameworks and platforms: commercial stream computing platform, open source flow computing framework, and flow computing framework developed by companies to support their own business.

(1) Commercial grade: InfoSphere Streams (IBM) and StreamBase (IBM).

(2) Open source stream computing framework, represented as follows: Storm (Twitter), S4 (Yahoo).

(3) the flow computing framework developed by the company to support its own business: Puma (Facebook), Dstream (Baidu), Yinhe data processing platform (Taobao).

4. Stream Computing Framework Storm

Storm is an open source distributed real-time big data processing framework of Twitter. With the increasingly wide application of stream computing, the popularity and role of Storm are increasing. Next, we introduce the core components of Storm and the performance comparison.

Core components of Storm

Nimbus: the Master of Storm, which is responsible for resource allocation and task scheduling. A Storm cluster has only one Nimbus.

Supervisor: the Slave of Storm, which is responsible for receiving tasks assigned by Nimbus and managing all Worker. A Supervisor node contains multiple Worker processes.

Worker: worker processes, with multiple Task in each worker process.

Task: tasks, each Spout and Bolt in the Storm cluster is executed by several tasks (tasks). Each task corresponds to an execution thread.

Topology: computing topology. The topology of Storm is an encapsulation of real-time computing application logic. Its function is very similar to the task of MapReduce (Job), except that a Job of MapReduce will always end after getting the result, and the topology will always run in the cluster until you manually terminate it. Topology can also be understood as a topology consisting of a series of Spout and Bolt interrelated through data streams (Stream Grouping).

Stream: data flow (Streams) is the core abstract concept in Storm. A data stream refers to a sequence of tuples (tuple) created and processed in parallel in a distributed environment. The data flow can be defined by a pattern that can represent the fields of the tuples in the data flow.

Spout: the data source (Spout) is the source of data flow in the topology. Typically, Spout reads tuples from an external data source and sends them to the topology. Depending on the requirements, Spout can be defined as either a reliable or unreliable data source. A reliable Spout can resend the tuple when the tuple it sends fails to ensure that all tuples are processed correctly; correspondingly, the unreliable Spout will not do any other processing to the tuple after the tuple is sent. A Spout can send multiple data streams.

Bolt: all data processing in the topology is done by Bolt. Through the functions of data filtering (filtering), function processing (functions), aggregations (aggregations), joins (join), database interaction and so on, Bolt can accomplish almost any kind of data processing requirements. A Bolt can implement simple data stream transformation, while more complex data stream transformations usually require multiple Bolt and multiple steps.

Stream grouping: determining the input data flow for each Bolt in a topology is an important part of defining a topology. Data flow grouping defines how data streams are divided in different tasks (tasks) of Bolt. There are eight built-in grouping of data streams in Storm.

Reliability: reliability. Storm can use the topology to ensure that each sent tuple is handled correctly. By tracking the tuple tree of each tuple emitted by Spout, you can determine whether the tuple has finished processing. Each topology has a message delay parameter, and if Storm does not detect whether the tuple has completed processing within the delay time, it will mark the tuple as a processing failure and resend the tuple later.

(figure 1:Storm Core components) Zhengzhou Infertility Hospital: http://wapyyk.39.net/zz3/zonghe/1d427.html

(figure 2:Storm programming model)

Comparison of mainstream computing engines

At present, the popular real-time processing engine is Storm,Spark Streaming,Flink. Each engine has its own characteristics and application scenarios. The following table is a simple comparison of the three engines.

(figure 3: performance comparison of mainstream engines)

Conclusion: the emergence of stream computing broadens our ability to cope with complex real-time computing needs. As a sharp tool of flow computing, Storm greatly facilitates our application. Stream computing engine is still developing, and JStorm,Blink and other computing engines based on Storm and Flink have greatly improved in all aspects of performance. Stream computing is worthy of our continued attention. Http://www.360doc.com/showweb/0/0/860282418.aspx

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.