Summarize the effects of Sina Weibo, Pinterest and Viacom on Redis databases 07/02 Update SLTechnology News&Howtos

Summarize the effects of Sina Weibo, Pinterest and Viacom on Redis databases

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "summing up Sina Weibo and Pinterest as well as Viacom to Redis database". In daily operation, I believe many people have doubts about Sina Weibo and Pinterest and Viacom on Redis database. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "summarize Sina Weibo and Pinterest as well as Viacom to Redis database". Next, please follow the editor to study!

Sina Weibo: the largest Redis cluster in history

Tape is Dead,Disk is Tape,Flash is Disk,RAM Locality is King. -Jim Gray

Redis is not a mature substitute for memcache or Mysql, but a good supplement to the architecture of large-scale Internet applications. Now more and more applications are also doing architecture transformation based on Redis. First of all, let's briefly announce the actual situation of the Redis platform:

220 billion commands/day 500 billion Read/day 50 billion Write/day

18TB + Memory

500 + Servers in 6 IDC 2000+instances

It should be a relatively large Redis use platform at home and abroad. Today, we mainly talk about the Redis service platform from the perspective of application.

Redis usage scenario

1.Counting (count)

The application of counting is described in more detail in another article, but the optimization http://www.xdata.me/?p=262 of counting scenarios is not described here.

Predictably, many students think that the cost of storing all the counts in memory is very high. I would like to express my point of view with a chart here:

In many cases, people would imagine that the pure use of memory would be very expensive, but the actual situation is often a little different:

COST, for applications with certain throughput requirements, will certainly apply for DB and Cache resources separately. Many students who are worried about DB write performance will also take the initiative to record DB updates in asynchronous queues, and the utilization of these three resources is generally not too high. After calculating the resources, you are surprised to find that on the contrary, the pure memory solution will be more streamlined!

KISS principle, which is very friendly for developers, I just need to establish a set of connection pools, do not have to worry about data consistency maintenance, do not have to maintain asynchronous queues.

The risk of Cache penetration. If the backend uses DB, it will certainly not provide high throughput. If the cache outage is not properly handled, it will be tragic.

Most of the initial storage requirements are small.

2.Reverse cache (reverse cache)

In the face of hot spots that often appear on Weibo, such as the recent emergence of a relatively hot short chain, tens of thousands of people click and jump in a short time, and some needs often emerge here, such as we quickly determine the level of users when redirecting. Whether there are some account bindings, gender preferences and so on, we have shown them different content or information.

The solution of memcache+Mysql is generally adopted, which can support larger throughput when calling id is legal. However, when the call to id is out of control and there are more junk users calling, a large number of memcache will penetrate into the Mysql server because of its miss, resulting in a crazy growth in the number of connections, a decrease in overall throughput and a slow response time.

Here, we can use redis to record a full amount of user decision information, such as string key:uid int:type, and do a reverse cache. When users quickly get their own level and other information in redis, then go to the Mc+Mysql layer to get all the information. As shown in the figure:

Of course, this is not an optimal scenario, such as using Redis to do bloomfilter, it may be more memory-saving.

3.Top 10 list

Product operators will always let you show the latest, hottest, highest click rate, highest activity and other conditions of the top list. Many lists that are updated more frequently are more likely to have cache invalidation if they are maintained by MC+MySQL. Given the small memory footprint, it is also quite good to use Redis for storage.

4.Last Index

Users' recent access records are also a good application scenario for redis list, and lpush lpop automatically expires the old login records, which is very friendly for developers.

5.Relation List/Message Queue

Here put the two functions at the end, because these two functions have encountered some difficulties in practical problems, but they have indeed solved many of our problems at a certain stage, so I will only explain them here.

Message Queue writes and consumes queues through list's lpop and lpush interfaces, which can solve most of the problems because of its good performance.

6.Fast transaction with Lua

Redis's Lua extension actually brings more application scenarios to Redis. You can write several command combinations as a small non-blocking transaction or update logic, such as: when you receive a message push, at the same time. Add an unread dialogue to your own 2. Add an unread message to your private message 3. Finally, give the sender a complete push message, this layer of logic can be implemented on the Redis Server side.

However, it should be noted that Redis will record all the contents of lua script in aof and transmit them to slave, which will also be a big overhead for disks and network cards.

7.Instead of Memcache

Many tests and applications have proved

Redis is not much behind memcache in terms of performance, while the single-threaded model brings strong scalability to Redis.

In many scenarios, the memory cost of Redis for the same data is less than that allocated by memcache's slab.

The data synchronization function provided by Redis is actually a powerful function extension to cache.

The key points of using Redis

1.rdb/aof Backup!

More than 95% of our online Redis is responsible for the back-end storage function. We are not only used as cache, but also as a kind of KMurv storage, which completely replaces the back-end storage service (MySQL), so its data is very important. It will be difficult to recover if there is data contamination, loss, misoperation and so on. So backup is very necessary! To this end, we have shared hdfs resources as our backup pool, hoping to restore the data needed by the business at any time.

2.Small item & Small instance!

Because of the model of Redis single thread (strictly speaking, it is not single thread, but the processing of request is single thread), the batch processing of large data structure list,sorted set,hash set means waiting for other requests, so the complex data structure of Redis must control the size of its single key-struct.

In addition, the memory capacity of a single instance of Redis should also be strictly limited. When the memory capacity of a single instance is large, the direct problem is that it takes a long time to recover from the failure or Rebuild from the database, and what is worse, when Redis rewrite aof and save rdb, it will bring great and long system pressure and occupy extra memory, which may lead to online failures such as insufficient system memory that seriously affect performance. Our online 96G/128G memory server does not recommend that the capacity of a single instance be greater than 20x30g.

3.Been Available!

Redis sentinel (Sentinel) is widely used in the industry.

Http://www.huangz.me/en/latest/storage/redis_code_analysis/sentinel.html

Http://qiita.com/wellflat/items/8935016fdee25d4866d9

Line 2000 C realizes the functions of server status detection, automatic failover and so on.

However, because their own actual architecture is often complex, or consider more angles, @ Xu Qi eryk and I worked on the hypnos project.

Hypnos is the mythical god of sleep, which literally means that we engineers don't have to deal with any problems during the break. : -)

Its working principle is as follows:

Talk is cheap, show me your code! Later, I will write a separate blog to talk about the implementation of Hypnos in detail.

4.In Memory or not?

It is found that when developers communicate with back-end resource design, they often ignore the evaluation of real users because of habitual use and misunderstanding of product positioning. Perhaps this is a piece of historical data, only the data of the most recent day is accessed, and it is very unreasonable to throw the capacity of historical data and the amount of requests in the last day to the storage reality of the memory class.

So when you are in exactly what kind of data structure to store, be sure to measure the cost first, how much data needs to be stored in memory? How much data is really meaningful to the user. Because this is actually very important to the design of back-end resources, the data capacity of 1G and 1T are completely different from the design idea.

Plans in future?

1.slave sync transformation

All online master-slave data synchronization mechanisms have been modified. We learn from MySQL Replication's idea and use rdb+aof+pos as the basis for data synchronization. Here is a brief description of why the official psync does not meet our needs:

Suppose A has two slave libraries B and C, and A `- slave C, and then we find that the master A server needs to be restarted or node An is down directly, and we need to switch B to the new master library. If A, B and C do not share rdb and aof information, C will still clear its own data as the slave library of B, because the C node only records the synchronization with node A.

Therefore, we need a synchronization mechanism to switch from A`-BackC structure to A`-B`-C structure. Although psync supports resuming transmission from breakpoints, it still cannot support smooth handover of master failures.

In fact, we have used the synchronization of the above features on our customized Redis counting service, and the effect is very good, solving the burden of operation and maintenance, but we still need to promote to all Redis services. If possible, we will also propose relevant sync slave improvements to the official Redis.

two。 Name-system Or proxy that is more suitable for redis

Careful students found that we not only use DNS as the naming system, but also have a record in zookeeper, why not let users directly access a system, zk or DNS choose one?

In fact, it is still very simple, naming system is a very important component, and dns is a relatively perfect naming system, we have made a lot of improvements and trial and error, the implementation of zk is still relatively complex, we do not have a strong granularity of control. We are also thinking about what to use as a naming system to better meet our needs.

3. Back-end data storage

The use of large memory is certainly an important direction of cost optimization, and flash disks and distributed storage are also planned for the future.

Second, Pinterest:Reids maintains tens of billions of correlations

Pinterest has become one of the craziest stories in Silicon Valley. In 2012, their PC-based business grew by 1047%, mobile adoption increased by 1698%, and the number of unique visits soared to 53.3 billion in March of that year. At Pinterest, people focus on tens of billions of things-every user interface asks whether a board or user's behavior contributes to extraordinarily complex engineering problems. This also gives Redis the opportunity to display his talents. After years of development, Pinterest has become a leader in media, social, and other fields. Its brilliant achievements are as follows:

The recommended traffic is higher than the sum of Google+, YouTube and LinkedIn.

Become the three most popular social networks together with Facebook and Twitter

Users who purchase with reference to Pinterest are higher than those of other websites.

As you might expect, based on the number of unique visits, the high scale of Pinterest leads to a very high IT infrastructure requirement.

1. Optimize the user experience through caching

Recently, Abhi Khune, the engineering manager of Pinterest, shared his company's user experience needs and Redis experience. Even spawning app creators don't understand these features until they analyze the details of the site, so get a general understanding of the usage scenario: first, pre-check for each fan mentioned; second, UI will accurately display the user's fans and watch list pages. Perform these operations efficiently, requiring a very high performance architecture for each click.

Inevitably, Pinterest software engineers and architects have used MySQL and memcache, but caching solutions still reach their bottlenecks; so caching must be expanded in order to have a better user experience. In practice, the engineering team has found that caching works only if the user sub-graph is already in the cache. therefore. Anyone using this system needs to be cached, which leads to the caching of the entire graph. At the same time, the answer to the most common query "whether user A cares about user B" is often no, but this is treated as a cache loss, resulting in a database query, so they need a new way to extend the cache. In the end, their team decided to use Redis to store the entire diagram to serve a large number of lists.

two。 Use Redis to store a large number of Pinterest lists

Pinterest uses Redis as the solution and pushes performance to the in-memory database level, saving multiple types of lists for users:

Follower list

List of board you follow

Fan list

Follow the list of users of your board

A list of users that you don't follow in board

Followers and non-followers of each board

Redis stores all the above lists for its 70 million users, essentially storing all the fan images and shredding them through the user's ID. Since you can view the data in the above list by type, the analysis profile is stored and accessed by a system that looks more like a transaction. Pinterest's current user like is limited to 100000, preliminary statistics: if each user follows 25 board, there will be a 1.75 billion relationship between the user and the board. And more importantly, these relationships increase every day with the use of the system.

Reids Architecture and Operation of 3.Pinterest

One of the founders of Pinterest learned that Pinterest began to write applications using Python and customized Django until it had 18 million user-level daily 410TB user data. Although multiple stores are used to store the data, engineers use 8192 virtual shards according to the user id, each running on a Redis DB, and one Redis instance will run multiple Redis DB. To make full use of the CPU core, both multithreaded and single-threaded Redis instances are used on the same host.

Given that the entire dataset runs in memory, Redis persists incoming writes per second on Amazon EBS. Expansion is mainly carried out in two aspects: first, to maintain a 50% utilization, through master-slave conversion, half of the Redis instances running on the machine will be translated to a new machine; second, to expand nodes and shards. The entire Redis cluster will use a master-slave configuration, and the slave will be treated as a hot backup. Once the master node fails, the slave part will immediately complete the master conversion, while a new slave part will be added, and the ZooKeeper will complete the whole process. At the same time, they run BGsave on Amazon S3 every hour for more persistent storage-the Reids operation takes place on the back end, and then Pinterest uses the data for MapReduce and analysis jobs.

3. Use case inventory of Viacom:Redis in the system

Viacom is one of the largest media groups in the world, but it also encounters one of the biggest data problems: how to deal with the growing dynamic video content.

Looking at the rising trend of this challenge, we will find that all the data volume in the world reached the ZB level in 2012, while in 2010 alone, the data generated by the Internet increased by 2.8ZB, most of which are unstructured, including videos and pictures.

Covering MVN (formerly known as MTV Networks, Paramount and BET), Viacom is a veritable media giant that supports many popular sites, including The Daily Show, osh.0, South Park Studios, GameTrailers.com and so on. As a media company, the documents, pictures and video clips on these sites are updated all the time. To make a long story short, let's get into the Redis practice shared by Michael Venezia, a senior architect at Viacom:

The website architecture background of 1.Viacom

For Viacom, spreading content across multiple sites makes it necessary to focus on the need for scale, and in order to spread content quickly to the appropriate users, they must also focus on the relationship between content. However, even on individual sites such as The Daily Show, Nickelodeon, Spike or VH1, the daily average PV can reach 10 million, and the peak traffic will reach 20-30 times the average. At the same time, based on the demand for real-time, dynamic scale and speed has become one of the foundations of the architecture.

In addition to the dynamic scale, the service must also guess the user's preferences based on the video or geographical location the user is browsing. For example, a page may associate a separate video clip with local promotions, extra parts of the video series, or even related videos. In order to allow users to stay on the site longer, they have built a software engine that automatically builds pages based on detailed metadata, which can recommend additional content based on users' current interests. Given the ever-changing interests, there are a wide range of types of data-- similar to graph-like, which actually do a lot of join.

This helps to reduce the number of copies of large files similar to videos, such as a separate record in the data store is the Southpark segment "Cartman gets an Anal Probe", which may also appear on German-speaking websites. Although the video is the same, English users may be searching for a different word. A copy of the metadata is converted into search results and points to the same video. So in the case of American users searching for the real title, German visitors may use the translated title-"Cartman und die Analsonde" on the German website.

These metadata cover other records or objects, and can also change the content according to the environment, through different rule sets to limit the content of different geographical locations or device requests.

The realization method of 2.Viacom

Although many organizations solve this problem by using ORM and traditional relational databases, Viacom uses a very different approach.

In essence, they can't afford direct access to the database at all. First, most of what they deal with is streaming data, and they tend to use Akamai to distribute content geographically. Second, tens of thousands of objects may be taken based on the complexity of the page. Fetching so much data obviously affects performance, so JSON is used in a data service. Of course, the caching of these JSON objects will directly affect the performance of the site. At the same time, when the content or the relationship between content changes, the cache needs to be updated dynamically.

Viacom solves this problem by relying on object primitives and superclasses, continuing to take South Park as an example: a private "episode" class contains all the information about the fragment, and a "super object" will help discover the actual video object. The idea of superclasses is really useful for building automatic construction of low-latency pages, and these superclasses can help to map and save primitive objects to caches.

Why does 3.Viacom use Redis?

Every time Viacom uploads a video clip, the system will create a private object and associate it with a superclass. With each change, they need to reevaluate every change in the private object and update all composite objects. At the same time, the system also requires an URL request in an invalid Akamail. The combination of the existing architecture of the system and the requirements of more agile management methods push Viacom to Redis.

Viacom-based is mainly based on PHP, so this solution must support PHP. They first chose memcached as object storage, but it does not support hashmap; very well and they also need a more effective revaluation of invalid steps, that is, a better understanding of content dependencies. In essence, they need to keep up with dependency changes in invalid steps. So they chose the combination of Redis and Predis to solve the problem.

Their team used Redis to build dependency diagrams for southparkstudios.com and thedailyshow.com, and after great success they began to look at other suitable scenarios for Redis.

Other usage scenarios for Redis

Obviously, if someone uses Redis to build dependency graphs, it makes sense to use it for object processing. Similarly, this has become the second use scenario chosen by the architecture team for Redis. Redis's replication and persistence features also conquered Viacom's operations team, so after several development cycles, Redis became the main data and dependency store for their site.

The last two use cases are buffers for behavior tracking and browse counts. The changed architecture is that Redis is stored in MySQL every few minutes, while browse counts are stored and counted through Redis. Redis is also used to calculate popularity, a scoring system based on the number of visits and access time-the more recent visits to a video, the more popular it is. Doing calculations every 10-15 minutes on so much content is definitely not a strength of traditional relational databases like MySQL, and Viacom's reason for using Redis is very simple-run a Lua batch job on a Redis instance that stores browsing information and calculate all the score tables. The information is copied to another Redis instance to support related product queries. At the same time, another backup has been made on MySQL for later analysis, and this combination will reduce the time consuming of this process by 60 times.

Viacom also uses Redis to store step-by-step job information, which is inserted into a list, and the staff uses the BLPOP command line to grab the top task in the queue. At the same time, zsets is used to integrate content from many social networks, such as Twitter and Tumblr, and Viacom synchronizes multiple content management systems through Brightcove video players.

Across these use cases, almost all Redis commands are used-- sets, lists, zlists, hashmaps, scripts, counters, and so on. At the same time, Redis has become an indispensable part of the extensible architecture of Viacom.

At this point, the study on "summing up Sina Weibo and Pinterest as well as Viacom to Redis database" is over. I hope I can solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.