Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to count the number of views of each post in Reddit

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to count the number of views of each post in Reddit. Many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

Counting mechanism

We have four main requirements for the counting system:

The number of views on posts must be real-time or near real-time, not a daily or hourly summary.

If the same user visits the post multiple times in a short period of time, it can only be counted as a number of views.

There is a small error of several percent between the number of page views displayed and the actual number of views.

Reddit is the eighth most visited website in the world, and the system should be able to operate on the scale of a production environment with a delay of only a few seconds.

The difficulty of meeting all these four needs is much greater than it sounds. In order to count accurately in real time, we need to know if a user has ever visited this post. To know this information, we need to maintain a collection of visiting users for each post, and then check the collection each time the pageviews are calculated. A naive is implemented by storing the collection of visiting users in the hashMap in memory, with the post Id as the key.

This implementation is feasible for posts with low traffic, but once a post becomes popular, it is difficult to control when traffic increases dramatically. Some posts even have more than 1 million unique visitors! For such posts, storing the ID of individual visitors and frequently querying whether a user has visited it before can put a heavy burden on memory and CPU.

Because we could not provide an accurate count, we looked at several different cardinality estimation algorithms. There are two options that meet our needs:

One is the linear probability counting method, which is very accurate, but the memory required becomes linearly larger when the counting set becomes larger.

The second is the counting method based on HyperLogLog (hereinafter referred to as HLL). The complexity of HLL space is low, but the accuracy is not as good as linear counting.

Let's take a look at how much memory HLL saves. If we need to store 1 million unique visitors'ID, each user's ID is 8 bytes long, then we need 8 megabytes of memory in order to store individual visitors to a post. Conversely, adopting HLL will significantly reduce the memory footprint. Different HLL implementations consume different amounts of memory. If you use the implementation method of this article, it takes only 12 KB to store 1 million ID, which is 0.15% storage!

Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory-High Scalability-this article summarizes the above algorithms very well.

Many HLL implementations combine the above two algorithms. Linear counting is used when the set is small, and when the set size reaches a certain threshold, it is switched to HLL. The former is usually called "sparse" HLL, while the latter is called "dense" HLL. The implementation of this combination of the two algorithms has great benefits, because it ensures accuracy for both small and large sets, as well as moderate memory growth.

Now we have decided to use the HLL algorithm, but when choosing a specific implementation, we considered the following three different implementations. Because our data engineering team uses Java and Scala, we only consider the implementation of Java and Scala.

The Algebird provided by Twitter is implemented by Scala. Algebird is well documented, but they are not easy to understand the implementation details of sparse and dense HLL.

The HyperLogLog++, provided in stream-lib is implemented in Java. The code in stream-lib is well documented, but it is difficult to understand how to use it properly and adapt it to meet our needs.

Redis HLL implementation, which is our final choice. We believe that the implementation of HLLs in Redis is well documented and easy to configure, and the relevant API provided is also easy to integrate. Another benefit is that we can deploy with a dedicated server to reduce the pressure on performance.

Reddit's data pipeline depends on Kafka. When a user visits a blog post, an event is triggered, and the event is sent to the event collection server and persisted in Kafka.

After that, the counting system runs the two components sequentially. In our counting system architecture, the * * part is a consumer of Kafka, which we call Nazar. Nazar reads each event from the Kafka and uses a series of configured rules to determine whether the event needs to be counted. We chose the name simply because Nazar is an amulet in the shape of an eye, and the "Nazar" system, like the eye, keeps our counting system from being destroyed by malicious people. One of the reasons we don't count an event is that the same user visits repeatedly in a very short period of time. Nazar modifies the event, adds a Boolean flag indicating whether it should be counted, and puts the event back into Kafka.

Here comes the second part of the system. We call the consumer of the second Kafka Abacus, which is used to calculate the number of real page views and display the results on the website or client. Abacus reads events that have been processed by Nazar from Kafka and decides whether to skip this event or add it to the count based on the processing results of Nazar. If the result of processing in Nazar is that a count can be added, Abacus first checks to see if the post associated with this event already has a HLL counter in Redis. If it already exists, Abacus will send Redis a request for PFADD. If it does not exist, Abacus sends a request to the Cassandra cluster (which Cassandra uses to persist HLL counters and count values), and then sends a SET request to Redis. This usually happens when netizens visit an older post, when the counter for that post is likely to have expired in Redis.

To store the pageviews of old posts whose counters are out of date in Redis. Abacus periodically writes all the HLL in Redis and the pageviews of each post to the Cassandra cluster. In order to avoid cluster overload, we write in batches with a cycle of 10 seconds.

The following figure shows the approximate flow of the event flow:

After reading the above, do you have any further understanding of how to count the number of views of each post in Reddit? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report