What is the PaaStorm internal mechanism of Yelp? 07/12 Update SLTechnology News&Howtos

What is the PaaStorm internal mechanism of Yelp?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the internal mechanism of PaaStorm of Yelp". In daily operation, I believe that many people have doubts about the internal mechanism of PaaStorm of Yelp. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubt of "what is the internal mechanism of Yelp?" Next, please follow the editor to study!

What's the meaning of the name?

The name PaaStorm is actually a combination of PaaSTA and Storm. So what exactly does PaaStorm do? To answer this question, let's first look at the basic architecture of the data pipeline:

If you mainly look at the "Transformer" step, you will know that most of the messages stored in Kafka cannot be imported directly into the target system. Imagine a set of Redshift clusters that are used to store ad push data. The ad push cluster only wants to store a certain field of the upstream system (such as the average weight of a business), otherwise it will save the original data and aggregate it. If the Redhift ad push cluster wants to store all upstream data, storage space will be wasted and system performance will be degraded.

In the past, various services wrote complex MapReduce tasks and processed the data before writing it to the target data store. However, these MapReduce tasks encounter the performance and scaling issues described above. One of the benefits of the data pipeline is that the consumer program can get the form of data it needs, no matter what the upstream data is supposed to be.

Reduce sample code

We could have let each consumer program do the data conversion in its own way. For example, the ad push system can write its own conversion service, extract the viewing statistics from the business data in Kafka, and maintain the conversion service on its own. This approach worked well at first, but in the end we ran into problems with the scale of the system.

We want to provide a transformation framework based on the following considerations:

Much of the transformation logic is generic and can be shared among multiple teams. For example, convert flag bits into meaningful fields.

Such transformation logic usually requires a lot of sample code. Such as connecting data sources or data purposes, saving state, monitoring throughput, fault recovery, and so on. Such code does not need to be handcuffed between various services.

To ensure that the data can be processed in real time, the data conversion operation should be as fast as possible and should be stream-based.

The most natural way to reduce the sample code is to provide a conversion interface. The specific logic of a transformation operation is completed in your service implementation interface, and then the rest of the work is done by our flow processing framework.

Using Kafka as the message bus

Originally PaaStorm was a transformation framework for Kafka-to-Kafka, but it slowly evolved to support other types of end nodes as well. Using Kafka as the end node of PaaStorm simplifies a lot of things: every service interested in data can register with Topic, pay attention to any converted data or raw data, and deal with new messages, regardless of who created the Topic. The converted data is persisted according to Kafka's retention policy. Because Kafka is a publish-subscribe system, downstream systems can also consume data whenever it wants.

Handle everything with Storm

When PaaStorm is adopted, how can we visualize the relationship between our Kafka Topic? Because data in some Topic flows to other Topic in a source-to-end manner, we can think of our topology as a directed acyclic graph:

Each node is a Kafka Topic, and the arrows represent the transformation operations provided by the PaaStorm. The name "PaaStorm" becomes more meaningful at this point: like Storm, PaaStorm provides real-time conversion of the source of the data stream (like Spout) through a transformation module (like Bolt).

PaaStorm internal mechanism

The core abstraction of PaaStorm is called Spolt (the combination of Spout and Bolt). As the name suggests, the Spolt interface defines two important things: an input data source and some kind of processing of the message data from that source.

The following example defines the simplest Spolt:

The Spolt processes each message in the "refresh_primary.business.abc123efg456" Topic, adds a field, stores the uppercase value of the 'name' field in the original message, and then sends out the new version of the processed message.

It is worth mentioning that all messages in the data pipeline are immutable. To get a modified message, create a new object. And, because we are adding a new field to the message body (that is, the added "uppercase name" field), the mode of the new message has changed. In a production environment, the schema ID of a message can never be written dead. We rely on Schematizer services to register a modified message and provide the appropriate schema.

By the way, the client library of the data pipeline provides several very similar ways to generate "spolt_source" using a combination of namespace, Topic name, source name, and schema ID. This makes it easy for a Spolt to find all the sources it needs and read data from it.

What is the processing related to Kafka?

You may have found that there is no code in the above Spolt that interacts with Kafka Topic. This is because in PaaStorm, all real Kafka interface-related processing is done by an internal instance (which happens to be called PaaStorm). The PaaStorm instance associates a specific Spolt with the corresponding source and destination, sends the message to the Spolt for processing, and publishes the message output from the Spolt to the correct Topic.

Each PaaStorm instance is initialized with a Spolt. For example, the following command starts a process with the UppercaseNameSpolt defined above:

PaaStorm (UppercaseNameSpolt (). Start ()

This means that anyone interested in writing a new converter can simply define a new Spolt subclass without changing anything related to the PaaStorm runtime at all.

From an internal point of view, the main method of the PaaStorm runtime is also surprisingly simple, with pseudo codes as follows:

The runtime first makes some settings: initialize producers and consumers, as well as message counters. Then it waits for new data in the upstream Topic. If new data arrives, use Spolt to process it. After Spolt processing, one or more messages are output, and the producer publishes it to the downstream Topic.

In addition, to mention briefly, the PaaStorm runtime also provides such as consumer registration, heartbeat mechanism (called "tick") and so on. For example, if a Spolt wants to empty its contents on a regular basis, it can be triggered with tick.

About state saving

PaaStorm guarantees reliable recovery from failures. In case of a collapse, we should start to re-consume from the right offset. Unfortunately, this correct offset is not usually the message we consume from the upstream Topic. The reason is that although we have already consumed it, in fact we haven't had time to release the converted version.

Therefore, the correct location for reboot should be the location where the upstream Topic corresponds to a message that has been successfully published downstream. After knowing the situation of a message sent downstream, we need to know which message is the corresponding upstream message, so that we can recover from there.

To facilitate the implementation of this function, PaaStorm's Spolt will add the Kafka offset in the upstream Topic corresponding to the original message to the converted packet when processing an original message. The converted message then passes this offset back in the producer's callback function. In this way, we can know the offset of the upstream Topic corresponding to a message in the downstream Topic. Because the callback function is called only after the producer successfully publishes the converted message, which means that the original message has been processed successfully, in this case, the consumer can safely submit the offset in that callback function. In case of a crash, we can start processing directly from upstream messages that have not been fully processed.

As you can see from the pseudo code above, PaaStorm also counts the number of messages consumed and the number of messages published. This allows interested users to check the throughput in the upstream and downstream Topic. This makes it easy to monitor and check the performance of any conversion operation. At Yelp, we send our statistics to SignalFX:

The SignalFX diagram can show the throughput of producers and consumers in a PaaStorm instance. In this example, the number of input and output messages does not match.

One of the benefits of doing separate statistics for producers and consumers in PaaStorm is that we can put the two throughput together to see where the bottleneck is. If this granularity is not reached, it is difficult to find performance problems in the pipeline.

At this point, the study on "what is the internal mechanism of Yelp PaaStorm" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.