Deployment and simple use of Storm environment 07/06 Update SLTechnology News&Howtos

Deployment and simple use of Storm environment

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

= = > what is Storm?

-- > Twitter officially opened up Storm, a distributed, fault-tolerant real-time computing system that follows Eclipse Public License 1.0.

Storm is a real-time processing system developed by BackType, and BackType is now under the command of Twitter.

-- > Storm provides a set of common primitives for distributed real-time computing that can be used in "stream processing" to process messages and update databases in real time.

Storm can also be used for "continuous computing" (continuous computation) to continuously query the data stream and output the results in the form of a stream.

Hou. It can also be used in "distributed RPC" to run expensive operations in parallel.

Nathan Marz, chief engineer of Storm, said: Storm can easily write and extend complex real-time computing in a computer cluster.

Storm is used for real-time processing, just as Hadoop is used for batch processing.

Storm ensures that every message is processed, and it is fast-- in a small cluster, millions of messages can be processed per second. What's even better is that you can use

Any programming language to do development

Storm supports offline computing and streaming computing

-- > offline computing: batch data acquisition, batch data transfer, periodic comparison calculation data, data presentation (Sqoop-- > HDFS-- > MR-- > HDFS)

-represents technology:

-- Sqoop imports data in bulk

-- HDFS stores data in bulk

-- MapReduce batch computing

-- Hive

-- > streaming computing: real-time data generation, real-time data transmission, real-time data computing, real-time display (Flume-> Kafka-> streaming computing-- > Redis)

-represents technology:

-- Flume acquires data in real time

-Kafka/metaq real-time data storage

-- Storm/JStorm real-time data computing

-- Redis real-time result cache, persistent storage (MySQL)

The difference between Storm and Hadoop

StormHadoop for real-time computing data for offline computing processing is stored in memory, and continuously processed data is stored in the file system and transmitted through the network to obtain data from the HDFS platform.

= > Storm architecture

= = > Storm run process

-- > responsibilities of each part of the Storm structure:

-Nimbus:

Responsible for resource allocation and task scheduling

-Supervisor:

Responsible for accepting tasks assigned by Nimbus, starting and stopping worker processes managed by yourself

(*) several worker processes can be started on the current supervisor through the configuration file

-Worker:

To run specific processing component logic, there are two types of tasks:

-- Spout task

-- Bolt task

-Executor:

After Storm 0.8, Executor is a specific physical thread in the Worker process. Task of the same Spout/Bolt may share a physical thread. Only Task belonging to the same Spout/Bolt can be run in an Executor.

-Task:

The thread of each spout/bolt in worker is called a task. After storm0.8, the task no longer corresponds to the physical thread. The task of different spout/bolt may share a physical thread, which is called executor.

Strom pseudo-distributed installation and deployment

-- > Zookeeper environment needs to be deployed before installation. See: https://blog.51cto.com/songqinglong/2062909

-- > decompression:

Tar zxf apache-storm-1.0.3.tar.gz-C / app

-- > configure environment variables

Vim ~ / .bash_profile # storm_home STORM_HOME=/app/apache-storm-1.0.3 export STORM_HOME PATH=$STORM_HOME/bin:$PATH export PATH

-- > modify configuration file

Vim $STORM_HOME/conf/storm.yaml # specify the zookeeper node here storm.zookeeper.servers:-"192.168.10.210" # # nimbus.seeds: ["host1", "host2" "host3"] # specify the nimbus node nimbus.seeds here: ["192.168.10.210"] # number of worker on each slave node supervisor.slots:ports:-6700-6701-6702-6703 # # enable the task Debug function "topology.eventlogger.executors": 1 # after the task is uploaded Saved directory storm.local.dir: "/ data/storm_data"

-- > start Storm

-pseudo-partial mode:

Storm nimbus & storm ui & # can be viewed through http: http://ip:8080 storm supervisor & storm logviewer &

-complete partial:

-- Primary Node

Storm nimbus & storm ui & storm logviewer &

-- Slave node

Storm supervisor storm logviewer

-- > View: http://ip:8080

= = > Strom fully distributed installation and deployment

-- > installation method is basically the same as pseudo-distributed, only need to copy the installation directory to other nodes

= > Storm HA

-- > simply modify nimbus.seeds: ["bigdata1"] in the storm.yaml file, add the host to this list, and start nimbus on the host

= = > Storm common commands

-- > submit task

-format: storm jar * * .jar [Toplogy name: class name] alias

-example:

Storm jar storm-starter-topologies-1.0.3.jar org.apache.storm.starter.WordCountTopology MyWordCountExample

-- > kill mission

-format: storm kill task name-w 10 Note:-w wait seconds

-example:

Storm kill MyWordCountExample-w 10

-- > deactivate task

-format: storm deactivte task name

-example:

Storm deactivte MyWordCountExample

-format: storm activate task name

-example:

Storm activate MyWordCountExample

-- > redeploy tasks

-format: storm rebalance task name

-example:

Storm rebalance MyWordCountExample

-- (*) when the cluster changes, this command will credit the topology, then restart the topology and reassign tasks within the corresponding timeout.

Flow Analysis of WordCount Program in Storm

-- > you can view the messages sent by each component of Storm (spout/blot) by looking at the events link for each component on the Storm UI

-- > need to enable the Debug function, add the following parameters to the configuration file and restart storm

"topology.eventlogger.executors": 1

= > Storm programming model

-- > Topology: name of a real-time application running in Storm

-- > Spout: get the source data stream in a topology and convert it to the internal source data of topology

-- > Bolt: accept data and perform processing, where users can perform the operations they want

-- > Tuple: the basic unit of a message delivery

-- > Stream: indicates the flow direction of data

-- > StreamGroup: data grouping strategy

-Shuffle Grouping: randomly grouped and evenly distributed to the downstream Bolt

-Fields Grouping: grouped by field and grouped by field value in the data. Tuple with the same field value is sent to the same Task.

-All grouping: broadcast

-Global grouping: global grouping. Tuple is assigned to the same Task in a Bolt to realize the Topology of physical properties.

-None grouping: no grouping

-Direct grouping: direct grouping, specified grouping

= = > data structures saved by Storm clusters in Zookeeper

= = > Strom task submission

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.