Basic concepts of storm installation testing 07/13 Update SLTechnology News&Howtos

Basic concepts of storm installation testing

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Introduction to strom

Official website: http://storm.apache.org/

Is a free, open source distributed real-time computing system, using it can easily achieve real-time data flow processing, Strom is very simple, can be used in any programming language

Storm use cases: real-time online analytical machine learning, continuous computing, distributed RPC,ETL, etc.

Characteristics of Strom: fast: each node where the benchmark clock is processed at more than 1 million tuples (can be understood as packets) per second

Simple setup: scalable, fault-tolerant, ensures data processing capacity, and is easy to set up and operate

Storm Real-time streaming Computing system

Comparison between storm Cluster and hadoop Cluster (MapReduce)

MapReduce is a batch process / / hadoop handles massive historical tasks, which cannot be achieved in real time.

Storm does not have a buffer and the original data flows into the processing system, which is stream processing / / real-time stream computing, running until it stops.

Topology (Topology) and Mapreduce

A key difference is that a MapReduce job will eventually end, and a topology will always run (unless you manually kill it)

Nimbus (Job Control and Resource Management master process) and ResourManager

There are two types of nodes in the Storm cluster: control node (master node) and work node (worker node). A daemon called Nimbus runs on the control node, and its function is similar to JobTracker / / JobTracker in Hadoop, which is the only global manager in the whole cluster, and the functions involved include job control and resource management.

Nimbus is responsible for distributing code in the cluster, assigning computing tasks to machines, and monitoring status.

Supervisor (worker process) and NodeManager (YarnChild)

Each work node runs a node called Supervisor. Supervisor monitors the work assigned to its machine and starts / shuts down the worker process as needed. Each worker process executes a subset of a topology; a running topology consists of many worker processes running on many machines.

The working Mechanism of storm Real-time streaming Computing

2 roles

Nimbus is the master node of the cluster: it is responsible for distributing code in the cluster, assigning computing tasks to machines, and monitoring status. / / Job control and resource management

Supervisor is the slave node of the cluster: each worker node runs a node called Supervisor, and there are worker processes running on the server in each supervisor, and these worker are actually working.

Nimbus and supervisor are not directly related, but need to be implemented by a third-party tool, zookeeper.

The worker in the first supervisor will call a class we wrote, such as (collecting water). After it is processed, it will again be encapsulated in this worker and sent to the next worker, and the next worker will handle the result passed to him by the last worker, call another logic we wrote (call the class filtering logic we wrote), and then process it in the second worker. Then it is encapsulated into a certain data packet format and sent to the next worker.

The next worker does not know what to do, but calls the logic written by our program (such as calling the precipitation class), and the processed data is again packaged into a packet and transmitted to the next worker.

In the last processing step, we will continuously put the results in an in-memory database, and whoever wants to use the data can use the data directly.

Summary:

The organization and coordination of the whole process does not need to be related by users, and users only need to define the specific business processing logic in each step.

The specific role of executing the task is that the specific behavior of the worker,worker when executing the task is determined by the business logic we define.

Summary of storm data processing flow

1. The client client submits the topology to the nimbus

2. Nimbus will put some information about task assignment on zookeeper.

3. Supervisor will get the assignment through zookeeper.

4. Supervisor is assigned to worker to run our task

The official explanation is as follows: / / for those who do not understand the basic concepts here, see below.

It can also be called Topology operation mechanism.

(1) after the Storm is submitted, the code is first stored in the inbox directory of the Nimbus node, and then a stormconf.ser file generated by the running configuration of Storm is placed in the stormdist directory of the Nimbus node, and the serialized Topology code file is also in this directory.

(2) when setting the Spouts and Bolts associated with Topology, you can set the number of executor and the number of task of current Spout and Bolt at the same time. By default, the sum of task of a Topology is the same as the sum of executor. After that, the system distributes the execution of these worker evenly according to the number of task. The supervisor node on which the worker runs is determined by the storm itself

(3) after the task is assigned, the Nimbes node will submit the task information to the zookeeper cluster. At the same time, there will be a workerbeats node in the zookeeper cluster, where the heartbeat information of all worker processes in the current Topology will be stored.

(4) the Supervisor node constantly polls the zookeeper cluster and stores all the Topology task assignment information, code storage directory, relationship between tasks and so on in the assignments node of zookeeper. Supervisor polls the contents of this node to get its own task and start the worker process to run.

(5) after a Topology is running, the Stream stream is continuously sent through Spouts, and the received Stream stream is processed continuously through Bolts. The Stream stream is × × ×.

The last step will be carried out without interruption unless the Topology is manually terminated.

There are several points that need to be explained:

(1) the constructor and declareOutputFields methods of each component (Spout or Bolt) are called only once.

(2) the calls of open method and prepare method are repeated. The parallelism parameter in setSpout or setBolt set in the entry function refers to the number of executor, the number of threads responsible for running the task in the component, and the number of times the above two methods will be called, once per executor runtime. Equivalent to the constructor of a thread.

(3) the nextTuple method and the execute method are running all the time, and the nextTuple method continuously transmits the execute of Tuple,Bolt and receives the Tuple for processing. Only by running continuously in this way can a × × Tuple stream be generated, which reflects the real-time performance. Equivalent to the run method of a thread.

(4) after submitting a topology, Storm creates an spout/bolt instance and serializes it. After that, the serialized component is sent to the machine where all the tasks are located (that is, the Supervisor node), and the component is deserialized on each task.

(5) the communication between Spout and Bolt and between Bolt and Bolt is realized through zeroMQ's message queue.

(6) the ack method and the fail method are not listed in the figure above. After a Tuple is processed successfully, you need to call the ack method to mark the success, otherwise the fail method flag fails and the Tuple is reprocessed.

Terminating Topology

Terminate the operation of a Topology by using the following command on the Nimbus node:

Bin/storm kill topologyName

After kill, you can view the topology status through the UI interface, which will first become KILLED, and after cleaning up the local directory and the information related to the current Topology in the zookeeper cluster, the Topology will disappear completely.

Summary of the role of zookeeper in storm

1. Nimbus will put some information about task assignment on zookeeper.

2. Supervisor will get the assignment through zookeeper.

3. Numbus needs to perceive the health status of supervisor through zookeeper.

The concept of Topology is similar to a task job submitted in MapReduce

There will be multiple worker processes on each supervisor

There are several executor threads running in each worker process

There are several identical task running in each executor

Deploy storm clusters

You need to rely on zookeeper when you have Nimbus Supervisor in strom, so make sure that zookeeper is installed when you install Strom

Configuration and deployment of storm

Download storm and upload it to linux

After decompressing, we modify the configuration in the conf directory.

Cd conf/

Vi storm.yaml

Tell zookeeper to deploy on those machines.

Storm.zookeeper.servers:

"hadoop-server-00"hadoop-server-01"hadoop-server-02"

Tell Strom nimbus it's on that mainframe.

Nimbus.host: "hadoop-server-00"

Save exit

Supervisor does not need to be specified, and its number can be increased or decreased dynamically.

And then distribute him to every machine.

Scp-r apache-storm-0.9.2-incubating/ hadoop-server-01:/usr/local/apps/

Scp-r apache-storm-0.9.2-incubating/ hadoop-server-02:/usr/local/apps/

To start storm, start zookeeper first.

Go to the bin directory of zookeeper and start zookeeper

. / zkCli.sh start

. / zkCli.sh status (check his status)

Start Strom

On the bin directory

. / storm nimbus (nimbus is configured on that machine to start nimbus on that machine)

Start Supervisor on the other two machines

01 on the bin directory on the machine

. / storm Supervisor

02 on the bin directory on the machine

. / storm Supervisor

You can see the number of processes through jps.

Storm can also be viewed through a web page, but you must start the process command to open the external service of the web page, and you must start the process on the machine that starts nimbus

The process command to start the external service is cd app/ (strom installation package) / bin/storm ui / / execute this command directly

Jps View process. / strom ui

The process of ui is called core

So we can see the status of Strom through the web page.

HTTP://hadoop-server-00:8080

Summary: / / this is the background startup.

On the nimbus host

/ / launch Coordination Management nimbus

. / storm nimbus 1 > / dev/null 2 > & 1 &

/ / start the web management interface. After starting, you can imitate the question via nimbus hostname: port 8080.

. / storm ui 1 > / dev/null 2 > & 1 &

On the supervisor host

. / storm supervisor 1 > / dev/null 2 > & 1 &

Slots represents: the slot, that is, the work process, the process started in the supervisor, starts 4 by default. When the kernel of your machine is very good, you can modify the configuration to increase the number of slots

You can know that the number of worker is 4 by default if there are no instructions.

(all processes need to be stopped before configuration. Press ctrl+c to stop the process.)

Finally add it to the configuration item vi storm.yaml (to write at the top)

Supervisor.slots.ports:

-6701

-6702

-6703

-6704

-6705

-6706

/ / these numbers represent the ports displayed by worker

Save exit

After that, we will distribute this configuration file to the other two machines

Scp storm.yaml hadoop-server-01:/usr/local/apps/strom (installation package) / conf/

Scp storm.yaml hadoop-server-02:/usr/local/apps/strom (installation package) / conf/

So the maximum number of worker each is 6.

Start Strom as a background process

On the 00 machine

Bin/storm nimbus 1 > / dev/null 2 > & 1 & (that is, start nimbus 1 to the null directory under dev [standard output from qualitative to this file] where 2 is also qualitative to 1, and finally & means to start a background process)

On the 00 machine

Bin/storm supervisor 1 > / dev/null 2 > & 1 &

Note: if there is an error to exit, we can look at the log file

Cd logs/

Less supervisor.log

On the 00 machine

Bin/storm ui 1 > / dev/null 2 > & 1 &

(in order to be visible on the web page, we have to start ui on the machine that starts nimbus.)

Let's switch to zookeeper and open the zookeeper client.

Cd / apps/zookeeper (installation package) / bin

. / zkCli.sh

You will find a node of storm.

Ls / strom

You will see the nodes under the Strom.

Ls / strom/supervisor

You will see the nodes under the supervisor, and each supervisor will have a corresponding id that corresponds to the id on the web page.

Configuration summary:

Storm related configuration items

Several options commonly used in storm.yaml

Storm.zookeeper.root

The root directory of Storm in the zookeeper cluster. The default is "/"

Topology.workers

The default number of worker per Topology runtime, which is overridden if set in the code

Storm.zookeeper.servers

List of nodes for zookeeper cluster

Storm.local.dir

Local storage directory where Storm stores jar packages and temporary files

Ui.port

The UI address and port number of the Storm cluster. Default is 8080.

Nimbus.host:

Host of the Nimbus node

Supervisor.slots.ports

The worker placeholder slots of the Supervisor node are shared by all Topology in the cluster. Even if a larger number of slots is set at the time of submission, the system will allocate them according to the actual number of slots remaining in the current cluster. When all the slots are allocated, the newly submitted Topology can only wait, and the system will always monitor whether any vacant slots are available. If so, assign the newly submitted Topology again.

Basic programming Concepts of storm

Topology: topology is also called a task, but once started, it never stops, similar to job in mapreduce, except that job automatically stops after processing a task.

Topology is also divided into spouts and bolts.

Spouts: the message source of the topology, similar to map in mapreduce, which reads the data source (fetches the data) for subsequent processing.

Bolts: the processing logic unit of the topology (the component after spouts is called bolts). Bolts can have many levels, dealing with different functions, similar to mapreduce's reduce, except that the bolts component can have any number of levels (processing data).

Tuple: message tuple / / is used as spouts to transfer data between bolts. After encapsulating the data, it is called tuple,tuple framework to realize the data transfer from spouts to bolts.

Multiple filed can be passed in tuple, and each filed can define a name.

/ / data transferred between spouts and bolts components must be encapsulated in tuple. Tuple can implement and define schema and specify which fields there are.

The route of data transfer between components is called streaming.

Stream: stream / / data flow direction

Stream grouping: the grouping policy in the flow can also be called the data flow policy, which can be understood as the shuffle phase in MapReduce, which refers to the data distribution rules between the two running instances in stream.

Partition (partition) strategy between maptask-- > reduce task in mapreduce (there are many strategies)

Tasks: task processing unit

Executor: worker process (thread in workers)

Workers: worker process (is a multithreaded program)

Tasks in executor. Executor in workers.

Configuration of configuration:topology

When programming, you need to import the jar package of storm. When we are working with the cluster area, each cluster machine should create a directory after storm analysis.

If you have three machines, you should create them on three machines.

After we finish the java program, we pack it into a jar package and pass it on to the linux machine.

In fact, the programming of storm is similar to that of mapreduce.

The command to execute storm is

Under the bin directory

. / storm jar ~ / phonetopo.jar client main class parameters

~ / phonetopo.jar: indicates the phonetopo.jar under the user's home directory

The parameter is: the name given to the cluster when it is submitted

After startup, we can check it through the command.

Bin/strom list

The program will run all the time for real-time online analysis.

We use orders to close the program.

Bin/storm kill phone-topo (phone-topo: the name given by the client)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.