CentOS 6.8 how to install and use JStorm clusters 07/12 Update SLTechnology News&Howtos

CentOS 6.8 how to install and use JStorm clusters

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of "how to install and use JStorm cluster in CentOS 6.8". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "how to install and use JStorm cluster in CentOS 6.8" can help you solve the problem.

JStorm is a real-time streaming computing framework implemented with reference to Apache Storm. It has made continuous improvements in network IO, thread model, resource scheduling, availability and stability, and has been used by more and more enterprises.

From the application point of view, JStorm application is a distributed application that complies with certain programming specifications. From the system point of view, JStorm is a set of scheduling system similar to MapReduce. From a data point of view, JStorm is a set of message processing mechanism based on pipeline.

Real-time computing is now the hottest direction in the field of big data, because people have higher and higher requirements for data, real-time requirements are also faster and faster, the traditional Hadoop MapReduce, gradually can not meet the demand, so there is a continuous demand in this field.

Storm component and Hadoop component compare JStormHadoop

Role NimbusJobTrackerSupervisorTaskTracker

WorkerChild

Application name TopologyJob programming interface Spout/BoltMapper/Reducer advantages

Before the advent of Storm and JStorm, there were many real-time computing engines on the market, but since the advent of Storm and JStorm, it can basically be said to be unified: its advantages:

Development is very fast: the interface is simple and easy to use. As long as you follow the programming specifications of Topology, Spout and Bolt, you can develop an application with excellent scalability. Actions such as redundancy between underlying RPC and Worker, data shunting and so on do not have to consider scalability at all: when the speed of the first-level processing unit is directly configured with concurrency, the linear expansion performance is robust: when the Worker fails or the machine fails. Automatically assign new Worker to replace failed Worker data accuracy: Ack mechanism can be used to ensure that the data is not lost. If there are more requirements for accuracy, the transaction mechanism is adopted to ensure the accuracy of the data. High real-time performance: the design of JStorm is biased towards single-line recording, so it has a lower latency than similar products in application scenarios.

The way JStorm processes data is message-based pipeline processing, so it is especially suitable for stateless computing, that is, all the dependent data of the computing unit can be found in the accepted message, and it is better that one data stream does not depend on another data stream.

Therefore, it is often used to:

Log analysis, analyze specific data from the log, and store the results of the analysis in an external memory such as a database. At present, mainstream log analysis technologies use JStorm or Storm pipeline systems to transfer data from one system to another, such as synchronizing the database to the Hadoop message converter, converting the received messages according to a certain format, storing them in another system such as message middleware statistical analyzer, extracting a field from the log or message, and then doing count or sum calculation. Finally, the statistical values are stored in the external memory. The intermediate process may be more complex. Real-time recommendation system, the recommendation algorithm is run in jstorm to achieve the basic concept of recommendation effect in seconds.

First of all, JStorm is a bit similar to Hadoop's MR (Map-Reduce), but the difference is that hadoop's MR, submitted to hadoop's MR job, ends after execution, and the process exits, while a JStorm task (called topology in JStorm) is always running 24 hours a day, unless the user takes the initiative to kill.

JStorm component

Next is a more classic rough structure diagram of Storm (like JStorm):

The faucet (well, a little tacky) in the picture is called spout, and lightning is called bolt.

In JStorm's topology, there are two components: spout and bolt.

# spout

Spout represents the input data source, which can be arbitrary, such as kafaka,DB,HBase or even HDFS. JStorm constantly reads data from this data source and sends it to the downstream bolt for processing.

# bolt

Bolt stands for processing logic. After bolt receives the message, it processes the message (that is, executes the user's business logic). After processing, it can continue to send the processed message to the downstream bolt, which will form a processing pipeline (pipeline, but more accurately, it should be a directed graph); it can also end directly.

Usually, the last bolt of a pipeline will do some data storage work, such as writing the real-time calculated data into DB, HBase, etc., for the front desk business to query and display.

Interface of the component

The JStorm framework defines an interface to the spout component: nextTuple, as the name implies, is to get the next message. When executed, it can be understood that the JStorm framework keeps calling this interface to pull data from the data source and send it to the bolt.

At the same time, the bolt component defines an interface: execute, which is where the user handles the business logic.

Each topology can have multiple spout, representing simultaneous receipt of messages from multiple data sources, or multiple bolt to execute different business logic.

Scheduling and execution

Then there is the principle of scheduling and execution of topology. A topology,JStorm will eventually be scheduled into one or more worker, and each worker is a real operating system execution process, distributed to one or more machines in a cluster to execute in parallel.

In each worker, there can be multiple task, each representing an execution thread. Each task is the implementation of the component mentioned above, either spout or bolt.

When submitting a topology, the user will specify the following execution parameters:

# Total number of worker

The total number of processes. For example, if I submit a topology and specify that the number of worker is 3, then there may be three processes executing in the end. This is possible because, depending on the configuration, JStorm may add internal components, such as _ acker or _ _ topology_master (both of which are special bolt), resulting in more processes being executed than the number specified by the user. By default, if the number of worker set by the user is less than 10, then _ _ topology_master exists only as a task, not only worker;. If the number of worker set by the user is greater than or equal to 10, then _ _ topology_master as a task will monopolize one worker.

# parallelism of each component

As mentioned above, each topology can contain multiple spout and bolt, and each spout and bolt can specify a separate degree of parallelism (parallelism), which represents how many threads (task) execute the spout or bolt at the same time.

In JStorm, each thread of execution has a task id, which increments from 1, and the task id in each component is contiguous.

The same topology above, which contains a spout and a bolt,spout with a degree of parallelism of 5 and parallelism of 10. So we end up with 15 threads to execute: 5 spout execution threads and 10 bolt execution threads.

At this time, the task id of spout may be 1: 5 task id, and the task id may be 6: 15, which is possible because when JStorm is dispatching, it does not guarantee that the task id must start from spout and then go to bolt. But the task id in the same component must be continuous.

# relationship between each component

That is, the user needs to specify which bolt should process the data sent by a particular spout, or an intermediate bolt, and which bolt should process the data it sends out.

Again, take the topology above as an example, which will be distributed across three processes. JStorm uses a uniform scheduling algorithm, so when executing, you will see that each process has five threads executing. Of course, because spout is five threads and cannot be evenly distributed among three processes, there will be only one spout thread for one process; similarly, there will be four bolt threads in one process.

During the running of a topology, if a process (worker) dies, JStorm detects it and tries to restart the process again and again. This is the concept of round-the-clock execution.

Communication of messages

As mentioned above, spout messages are sent to a specific bolt,bolt or to other bolt, so how do you communicate with each other?

First of all, when sending a message from spout, JStorm will calculate the list of target task id to be sent, and then see whether the target task id is in this process or in other processes. If it is in this process, you can directly communicate within the process (such as putting the message directly into the execution queue of the target task in this process); if it is cross-process, then JStorm will use netty to send the message to the target task.

Real-time calculation result output

JStorm runs 24 hours a day, and external systems do not directly request JStorm if they need to query the processing results at a particular point in time (of course, DRPC can support this requirement, but the performance is not very good). Generally speaking, in the spout or bolt of JStorm, there is a logic that writes the calculation results to the outside regularly, so that the data can be stored in real-time or near real-time according to the business requirements, and then directly query the calculation results in the external storage.

Paste the above content directly to the official website of JStorm, do not complain

II. Jstorm cluster installation

1. System environment preparation

# OS: CentOS 6.8 mininal# host.ip: 10.1.1.78 aniutv-1# host.ip: 10.1.1.80 aniutv-2# host.ip: 10.1.1.97 aniutv-5

2. Installation directory customization

Jstorm: / opt/jstorm (source code installation)

Zookeeper: / opt/zookeeper (source code installation)

Java: / usr/java/jdk1.7.0_79 (rpm package installation)

3. Zookeeper cluster installation

Zookeeper Cluster reference (http://blog.csdn.net/wh311212/article/details/56014983)

4. Zeromq installation

Zeromq download address: http://zeromq.org/area:download/

Download zeromq-4.2.1.tar.gz to / usr/local/src

Cd / usr/local/src & & tar-zxf zeromq-4.2.1.tar.gz-C / opt

Cd / opt/zeromq-4.2.1 & &. / configure & & make & & sudo make install & & sudo ldconfig

5. Jzmq installation

Cd / opt & & git clone https://github.com/nathanmarz/jzmq.git./autogen.sh & &. / configure & & make & & make install

6. JStorm installation

Wget https://github.com/alibaba/jstorm/releases/download/2.1.1/jstorm-2.1.1.zip-P / usr/local/srccd / usr/local/src & & unzip jstorm-2.1.1.zip-d / optcd / opt & & mv jstorm-2.1.1 jstorm# mkdir / opt/jstorm/jstorm_dataecho'# jstorm env' > > ~ / .bashrcecho 'export JSTORM_HOME=/opt/jstorm' > > ~ / .bashrcecho' export PATH=$PATH:$JSTORM_HOME/bin' > > ~ / .bashrcsource ~ / .bashrc

# JStorm configuration

Sed-I / 'storm.zookeeper.servers:/a\-"10.1.1.78" / opt/jstorm/conf/storm.yamlsed-I /' storm.zookeeper.servers:/a\-"10.1.1.80" / opt/jstorm/conf/storm.yamlsed-I / 'storm.zookeeper.servers:/a\-"10.1.1.97" / opt/jstorm/conf/storm.yamlsed-I /' storm.zookeeper. Root/a\ nimbus.host: "10.1.1.78" / opt/jstorm/conf/storm.yaml

Configuration items:

Storm.zookeeper.servers: the address that represents the zookeeper

Nimbus.host: the address that represents the nimbus

Storm.zookeeper.root: indicates the root directory of JStorm in zookeeper. This option needs to be set when multiple JStorm share a zookeeper. The default is "/ jstorm".

Storm.local.dir: indicates the JStorm temporary data storage directory. You need to make sure that the JStorm program has write permission to this directory.

Java.library.path: installation directory for Zeromq and java zeromq library, default "/ usr/local/lib:/opt/local/lib:/usr/lib"

Supervisor.slots.ports: indicates the Slot list of ports provided by Supervisor. Be careful not to conflict with other ports. The default is 68xx, while Storm is 67xx.

Topology.enable.classloader: false. Classloader is disabled by default. If the jar of the application conflicts with the dependent jar of JStorm, for example, if the application uses thrift9, but jstorm uses thrift7, you need to open classloader. It is recommended that you turn it off by default at the cluster level and turn on this option on the topology that needs to be quarantined.

# the following command only needs to be executed on the machine where the jstorm_ui is installed and the jar node is submitted

Mkdir ~ / .jstormcp-f $JSTORM_HOME/conf/storm.yaml ~ / .jstorm

7. Install JStorm Web UI

To force the use of tomcat7.0 or above, remember to copy * * ~ / .jstorm/storm.yaml,** Web UI on the same node as Nimbus.

Mkdir ~ / .jstormcp-f $JSTORM_HOME/conf/storm.yaml ~ / .jstorm download tomcat 7.x (take apache-tomcat-7.0.37 as an example) tar-xzf apache-tomcat-7.0.75.tar.gzcd apache-tomcat-7.0.75cd webappscp $JSTORM_HOME/jstorm-ui-2.1.1.war. / mv ROOT ROOT.oldln-s jstorm-ui-2.1.1 ROOT # and not ln-s jstorm-ui-2.1.1. War ROOT, be careful with cd.. / bin./startup.sh.

8. JStorm starts

1. Execute "nohup jstorm nimbus &" on the nimbus node (10.1.1.78) and check $JSTORM_HOME/logs/nimbus.log for errors

two。 Execute "nohup jstorm supervisor &" on the supervisor node (10.1.1.78, 10.1.80, 10.1.1.97), and check $JSTORM_HOME/logs/supervisor.log for errors.

That's all for "how to install and use JStorm clusters in CentOS 6.8". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.