How to realize big data platform based on Storm 07/04 Update SLTechnology News&Howtos

How to realize big data platform based on Storm

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to realize big data platform based on Storm". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "how to implement the big data platform based on Storm"!

Why to build a real-time data platform?

First of all, let's introduce the background, why do we have to do this data platform? In fact, if you understand the business of Ctrip, you will know that there are many business departments of Ctrip. In addition to the two major businesses of hotels and air tickets, there are nearly 20 SBU and public departments. Their business patterns vary greatly and change quickly. The original data processing method in the form of Batch has been difficult to meet the needs of various business data acquisition and analysis, and they need to analyze and process data in more real time.

In fact, before this unified real-time platform, various departments also do some real-time data analysis applications, but there are a lot of problems:

First of all, there are a variety of technical options, message queues use ActiveMQ, useful RabbitMQ, also use Kafka, analysis platform useful Storm, useful Spark-streaming, but also write their own programs to deal with; because the technical strength of business departments is uneven, and their main energy is still focused on the implementation of business requirements, so the stability of these real-time data applications is often difficult to guarantee.

The second is the lack of peripheral facilities, such as alarm, monitoring and so on.

* is that the sharing of data and information is not smooth, if you want to use the real-time data of the hotel on vacation, it will be difficult to analyze and deal with different systems. Therefore, under this premise, it is necessary to build a unified real-time data platform.

What kind of real-time data platform do you need

This unified data platform needs to meet four requirements:

The first is stability, which is the lifeline of any platform and system.

The second is the complete supporting facilities, including test environment, online, monitoring and alarm.

Thirdly, it is convenient for information sharing, which has two meanings: 1, data sharing; 2, application scenarios can also be shared, for example, one department will be inspired by a real-time analysis scenario of another department. You can also do some similar applications in your own business field.

* with the timeliness of service response, users will encounter a variety of problems in the whole process of development, testing, launch and maintenance, and need timely help and support.

How to realize

After clarifying these requirements, we began to build the platform, of course, the step must be a problem of technology selection. Kafka on the message queue side has become an established de facto standard; however, there are still many candidates for the choice of real-time processing platforms, such as Linkedin's Samza, apache's S4, and the most mainstream of course are Storm and Spark-streaming.

For the sake of stability and maturity, we chose Storm as the real-time platform at that time. If I look at it again now, I think both Spark-streaming and Storm are OK, because both platforms are more mature now.

The architecture diagram is relatively simple, that is, to collect this log from some business servers, or some business data, and then write it into Kafka in real time. Storm jobs read data from Kafka, calculate, and spit out the calculation results to the external storage that each business line depends on.

So is it enough for us to just build this? Of course, it's not enough, because it's just something for operation and maintenance, and you just build up the modules of a system.

The two most critical requirements of the platform mentioned above: data sharing and the overall stability of the platform are difficult to be guaranteed. We need to do system governance to meet the key requirements of these two platforms.

First of all, let's talk about the problem of data sharing. we usually think that the premise of data sharing is that users should clearly know the business meaning of using the data source and the Schema of the data in it, and users can see this information very simply in a centralized place; our solution is to use Avro to define the Schema of the data and put the information on a unified Portal site. The data producer creates the Topic, and then uploads the Schema in Avro format. The system will generate the Java class according to the Schema of the Avro, generate the corresponding JAR, and add the JAR to the Maven warehouse. For the data user, he only needs to add the dependency directly to the project.

In addition, we encapsulate the API of Storm to help users achieve the deserialization process. The sample code is as follows: as long as the user inherits a class and then formulates the corresponding class of the message, the system can automatically deserialize the message. What you get in the process method is the object that has been deserialized, which is very convenient for the user.

Secondly, let's talk about resource control, which is the basis for ensuring the stability of the platform. We know that Storm is not very good at resource isolation, so we need to control the concurrency of users' Storm jobs. Our approach is to encapsulate the interface of Storm, remove the original method of setting the concurrency of topology and executor, and move these settings to Portal. Here is the code for the example:

In addition, we have mentioned earlier, we have made a unified Portal to facilitate user management, users can view Topic-related information, can also be used to manage their own Storm jobs, configuration, startup, Rebalance, monitoring and other functions can be completed above.

After completing these functions, we began to access the initial business. In the initial business, we only connected to two data sources. The traffic of these two data sources is relatively large, that is, one is UBT (Ctrip's user behavior data), the other is Pprobe's data (application traffic log), which is basically Ctrip's behavior access log. The main applications focus on real-time data analysis and data reports.

In the early stages of building the platform, we have some experiences to share with you:

The most important design and planning need to be done in advance, because the later the adjustment, the greater the cost.

Concentrate efforts to achieve the core function

Access the service as soon as possible, on the premise that the core functions are completed and stable, the earlier the access to the service, the better. Only when a system is really used can it continue to evolve.

There must be a certain amount of access business, because our initial access is the entire UBT of Ctrip, that is, the data of user behavior, which can quickly help the whole platform to stabilize. Because your platform has just been built, there must be all kinds of problems, that is, after passing the verification of large traffic, one is to help the platform stabilize and repair all kinds of bug, and the second is to help us accumulate experience in technology and operation and maintenance.

After that, we did a series of work to improve the "peripheral facilities" of the platform:

First of all, import the Storm log into ES and display it through Kanban; the native Storm log is not convenient to view, and there is no search function. After the data is imported into ES, it can be displayed in the form of icon, and there is also a full-text search function, which is very convenient when troubleshooting.

Secondly, there are some improvements related to metrics; in addition to the metrics of Storm's own Build in, we have also added some general burial points, such as the time from the arrival of the message to Kafka to the time it took to be consumed; in addition, we have implemented a custom MetricsConsumer, which will write all the metrics information in real time to Ctrip's own Kanban systems Dashboard and Graphite, and the information in Graphite will be used as an alarm.

Third, we have established a sound alarm system. Based on the metrics data output to Graphite, users can configure their own alarm rules and set the priority of the alarm. For high-priority alarms, the system will use the feature of TTS to automatically dial contacts, while low-priority alarms will send emails. By default, we will help users add the default alarms of the number of Failed and consumption congestion.

Fourth, we provide universal Spout for adapting Ctrip Message Queue and general Bolt for writing Redis,HBbase,DB to simplify users' development work.

* We have also thought of some methods in dependency management to facilitate the upgrade of API In version 2.0 of muise-core (our packaged Storm API project), we reorganized the relevant API interfaces, and the subsequent version tried to ensure that the interfaces were downwards compatible, and then promoted all businesses to upgrade again. After that, we put muise-core 's jar package as one of the standard Jar packages in the lib folder of each supervisor's storm installation directory. In the subsequent upgrade, if it is a mandatory upgrade, contact the user and restart Topology one by one. If the upgrade does not need to be promoted by force, it will take effect the next time the user restarts Topology.

After doing this work, we began to access a large-scale business, in fact, basically covered all the technical teams of Ctrip, and the types of applications are much richer than in the initial stage.

The following is a brief introduction to some real-time applications in Ctrip.

Mainly divided into the following four categories:

Real-time data report

Real-time business monitoring

Marketing based on users' Real-time behavior

Application of risk control and security.

* shows the website data monitoring platform cDataPortal on this side of Ctrip. Ctrip will monitor the performance of each web page visit in detail, and then show it through various charts.

The second application is the application of Ctrip in AB Testing. In fact, we all know that AB Testing can only get results after a long period of time, and it needs to reach a certain amount before it is statistically significant. Where does it need real-time calculation? Real-time computing mainly plays the role of monitoring and alarm here: when AB Testing is launched, users need a series of real-time metrics to observe the effect of diversion to determine whether it is configured correctly; in addition, they need to check the impact on orders. If there is a significant impact on orders, they need to be able to detect and stop in time.

The third application is related to personalized recommendation, which is more of a combination of users' historical preferences and real-time preferences to recommend some scenarios. The collection of real-time preferences here is actually done through this real-time platform. Similar applications are based on the user's real-time access behavior to push some more interesting strategies, the group tour will be based on the user's real-time access, and then push some coupons to the user.

The pits that have been stepped on

After talking about the application of the real-time data platform in Ctrip, let's briefly talk about some of our experiences in this process.

First of all, it's technical. Let's talk about the pit we encountered.

The Storm version we use is 0.9.4. We have encountered two BUG of Storm itself. Of course, these two bug are relatively sporadic. You can take a look at them. If you encounter corresponding problems, you can refer to:

Storm-763:Nimbus has assigned worker to other nodes, but netty clients of other worker do not connect to the new worker

Emergency response: Kill drops the worker process or restarts related jobs.

Storm-643: when failed list is not empty and some offset is out of the Range range, KafkaUtils will repeatedly fetch the relevant message

In addition, there are some problems in the use of users. For example, if possible, we generally recommend users to use localOrShuffleGrouping. When using it, the number of upstream and downstream Bolt should match, otherwise most of the downstream Bolt will not receive data. In addition, users should make sure that the member variables in Bolt are serializable, otherwise an error will be reported when running on the cluster.

Then there is the experience of support and team, first of all, its alarm and monitoring facilities are necessary before a large number of access, these two systems are the premise of a large number of access, otherwise it is difficult to find or quickly locate and solve extraordinary problems.

The second is to say that clear instructions, guidelines and Qaccouna can save a lot of support time. Before the user develops, all you have to do is to provide him with this document, and then consult him if you have any questions.

The third is to grasp an access rhythm, because our entire platform has relatively few developers, only three to four students. Although we have already served all the customers to deal with various problems of each BU, we will still be too busy if we connect too many projects at the same time. In addition, another important point of support is to "teach people to fish". Tell them in detail when you support them, so that they can understand the basic knowledge of Kafka and Storm, so that they can digest some simple problems internally and do not have to come to your team for support.

New exploration

What we talked about is basically what we did last year, and this year we have made some new attempts in two directions: Streaming CQL and JStorm. Let's share with you the progress in these two areas:

Streaming CQL is a real-time streaming SQL engine of Huawei Kaiyuan. Its principle is to convert SQL directly into Storm Topology, and then submit it to the Storm cluster. Its syntax is very similar to the standard SQL, except that some window functions are added to deal with real-time processing scenarios.

Next, I will show you a simple example through a simple example to give you an intuitive feeling. My example is

Read data from kafka with type ubt_action

Take out the page,type,action,category and other fields and aggregate them according to the page,type field every five seconds.

* write the result to console.

If you need to use Storm to implement, you generally need to implement four classes and a main method; with Streaming CQL, you only need to define the input Stream and output Stream, using a sentence SQL can achieve business logic, very simple and clear.

Then we have also done some work on the basis of Huawei's open source:

Add Redis,Hbase,Hive (small table, load memory) as Data Source

Add Hbase,MySQL / SQL Server,Redis as Sink for data output

Fix the parsing error of MultiInsert statement and feed it back to the community

Added the function of In to where statement

Support reading data from Ctrip's message queue Hermes.

The advantage of Streaming CQL*** is that it is very convenient for BI colleagues who do not know how to write Java to implement some real-time reports and applications with simple logic. For example, a vacation example mentioned below is basically completed in about 70 lines. The original development and testing time is about a week, but now it can be complete in one day, improving their development efficiency.

[case]

Vacation BU needs to count the proportion of each user visiting "individual tour", "group tour" and "semi-self-help tour" products in real time, so as to further enrich the data of user portraits:

Data streams: data from UBT

Data Source: dimension table using product in Hive

Output: Hbase.

The second direction we try this year is that the kernel of Jstorm,Storm is written in Clojure, which brings some difficulties to subsequent in-depth research and maintenance. Jstorm is an open source project of Ali, which is fully compatible with storm's programming model, and the kernel is all written in Java, which facilitates follow-up research and in-depth investigation. Ali's Jstorm team is very Open and professional, and we have worked together to solve some problems encountered in use; in addition to the advantage of writing the kernel using Java, Jstorm also has some advantages over storm in performance. in addition, it also provides resource isolation and anti-pressure mechanisms such as Heron, so it can better deal with message congestion.

We have basically moved 1/3 of the storm applications to Jstorm, and we are using version 2.1. I have some experience to share with you in the process of using it:

* points are some of the problems we encountered in integrating with kafka, which have been fixed in the new version:

In Jstorm, there are two different ways to implement Spout: Multi Thread (nextTuple,ack & fail methods are called in different processes) and Single Thread. Native Storm Kafka Spout needs to be run using Single Thread.

Fixed a problem with Single Thread mode (the new version has been fixed).

The second point is that the metrics mechanism of Jstorm is completely incompatible with the mechanism of storm, so the relevant code needs to be rewritten, mainly including adapting the Metrics in the API of Kafka Spout and our Storm and using the function of MetricsUploader to achieve the function of writing data to Dashboard and Graphite. In addition, we combine the two API to provide a unified interface, which is compatible with the two environments and makes it convenient for users to record custom metrics.

That's what I want to share. At the end, I would like to briefly summarize our overall structure:

The bottom layer is the open source framework of message queuing and real-time processing system, as well as some monitoring and operation and maintenance tools of Ctrip. The second layer is API and services, while the top provides all the functions to users in the form of Portal.

At this point, I believe you have a deeper understanding of "how to achieve the big data platform based on Storm". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.