Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the process of data request processing on Twitter server

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article focuses on "what is the process of data request processing on the Twitter server". Interested friends may wish to take a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the process of data request processing on Twitter server"?

I. the core business of twitter

The core business of twitter lies in following and be followed

(1) following- followers go to the personal home page and you will see the messages (no more than 140characters) posted by your follow people. This is the process of following.

(2) followed- is followed, you post a message, follow your people will see this message, this is the process of be followed

Second, the business logic of twitter

The business logic of twitter is not complex.

Following business, check who follow has and the messages they have posted.

For followed service, the front-end js polls the back-end to see if follow users have any new messages or updates (the timeliness of updates depends on the polling time)

Third, three-tier architecture (three-tier architecture)

Website architecture design, the traditional approach is a three-tier structure, the so-called "traditional" does not mean "outdated", trendy technology is not mature, the traditional way is more robust.

(1) presentation layer (presentation tier): apache web server. The main task is to parse the http protocol and distribute the request to the logical layer.

(2) logical layer (logic tier): mongrel rails server, using rails off-the-shelf modules to reduce workload

(3) data layer (data tier): mysql

Presentation layer: the presentation layer has two main functions: (1) http protocol processing (http processor); (2) dispenser (dispatcher); of course, it is not only browsers that access twitter, but also mobile phones, because there may be other protocols, so there may be other processor.

Logical layer: when the user publishes the message, execute in turn: (1) save the message to the msg table; (2) check the user relation table and find out its followed_ids; (3) get the status of the user in the followed_ids; (4) online ids, the message push into a queue queue; (5) queue in the msg, update the ids home page; here to use the queue, there are many ways to achieve, for example, the apache mina,twitter team implemented a kestrel.

Data layer: the core of twitter is user; message; user relationship. Around these cores, the schema design of its core data: (1) user tables userid, name, pass, status, … (2) message table msgid, author_id, msg, time, … (3) user relation tables relationid, following_ids, followed_ids.

In any case, the architectural framework is clear as follows:

4. Cache=cash means caching equals revenue

The use of cache is very important to the architecture of large websites, the response speed of the website is the most obvious factor affecting the user experience, and the biggest enemy affecting the response speed is the disk Imando. Twitter engineers believe that the average response time of a good experience site should be around 500ms, and the ideal time is 200-300ms. The use of cache is a highlight of the twitter architecture. The architecture with cache is clear as follows:

Where do I need cache? The more frequent IO is, the more cache is needed. Database is the most frequently accessed place in IO. Is it necessary to put the three core tables in memory? What twitter does is to split the table and load the most frequently accessed fields into the cache.

(1) vector cache and row cache is array cache and row cache

Array cache: the msgids of newly published messages and the ids of related authors. These id are accessed frequently, and the cache where they are stored is called vector cache

Line cache: when the cache; memory of the line of the message body is limited, the priority vector cache is given. The hit rate of the actual result vector cache is 99%, and the row cache is 95%.

(2) fragment cache and page cache

In addition to web pages (web channel), users who visit twitter also have mobile phones (API channel), which accounts for 80% of total traffic. In addition to mysql cache, the center of gravity of cache will be on the API channel. The main body of the mobile phone screen is a screen of messages, you might as well divide the whole page into several parts, each part corresponds to some / a message, these are fragment. Popular authors, caching the fragment of their pages, can improve the efficiency of reading their published messages, which is the mission of fragment cache. Popular authors, people will also visit their home page, this is the mission of page cache. The actual result shows that the hit rate of fragment cache is 95% and the page cache is 40%. Although the hit rate of page cache is low, because it is visiting the home page, it takes up a lot of space. In order to prevent the two kinds of cache from affecting each other, the two kinds of cache need to be deployed on different physical machines. Both fragment cache and page cache of twitter use memcached.

(3) http accelerator accelerator

The caching problem of web channel also needs to be solved. After analysis, the pressure of web channel mainly comes from search. In the face of an emergency, readers will search for relevant information, regardless of whether the authors of the information are those of their own follow. To reduce the search pressure, you can cache the search keywords with the search content. Here, twitter's engineers used varnish. Interestingly, varnish is usually deployed on the outer layer of web server, visiting varnish first, but the engineers who visit web server;twitter put varnish on the inner layer of apache web server because they think that the operation of varnish is complex, and they are afraid that the collapse of varnish will cause system paralysis, so they adopt this conservative deployment method. Twitter does not disclose the hit rate of varnish, claiming that the load of the entire site has dropped by 50% after the use of varnish.

5. Flood fighting needs isolation

Another highlight of the twitter architecture is its message queue: isolating user operations and smoothing traffic peaks.

When the restaurant is full, new customers cannot be served, but they are not shut out, but let them wait in the lounge now.

When a user visits twitter, he is received by apache web server, while apache cannot receive an unlimited number of users. What to do when Obama delivered his inaugural speech and twitter traffic soared on January 20, 2009.

In the face of the flood peak, how to ensure that the website does not collapse? Accept quickly, but postpone service.

Apache receives the request and forwards it to Mongrel, which is actually handled by Mongrel, while apache frees up for the next user. However, the number of users that apache can receive is always limited, and its concurrency is affected by the number of work processes that apache can accommodate. The internal principle of apache is not examined here. The figure is as follows:

VI. Data flow and control flow

Quick acceptance and postponement of service is only a delaying tactic to prevent users from receiving 503 (service unavailable).

The real flood control capacity is reflected in two aspects: flood storage and flood discharge:

(1) twitter has a huge memcached cluster, which can store flood with large capacity.

(2) twitter's own kestrel message queue, as a means of drainage and flood discharge, transmits control instructions (drainage and channels); when the flood peak arrives, twitter controls the data flow and evacuates the data to multiple machines in time to avoid pressure concentration and cause system paralysis.

The following example illustrates the internal process of twitter, assuming that there are two authors who send messages through the browser, and a reader also reads their messages through the browser.

(1) Log in to apache web server,apache to assign a working process to serve it, log in, check id, write cookie, etc.

(2) upload the newly written message, forward the author id, message, etc., to Mongrel,apache and wait for Mongrel to reply, so as to update the author's home page and update the newly written message.

(3) after Mongrel receives the message, it allocates a msgid to cache msgid and author id to vector memcached; at the same time, Mongrel asks vector memcached to find out who the author is follow. If the cache does not hit, it will go to the backend mysql to find it, and merge the cache; reader ids to Mongrel,Mongrel to cache the msgid and SMS text to row memcached.

(4) Mongrel notifies kestrel message queuing server that each author and reader has a queue (if not created); Mongrel puts msgid into the reader's queue and the author's own queue.

(5) A Mongrel, which may be dealing with a queue of id, will add this message to the home page of the id user; (6) Mongrel will update the author's home page to the front-end waiting apache,apache and return to the browser.

Flood Peak and Cloud Computing

Needless to say, when the flood peak can not be carried, we can only add machines. Where did the machine come from? Renting equipment from cloud computing platform companies. Of course, the equipment only needs to be rented at the peak of the flood to save money.

VIII. The compromise between push and pull

As you can see, the workflow of Mongrel:

(1) put the relevant ids into vector memcached and row memecached even if the message is published successfully, but not responsible for the storage of the mysql database

(2) putting the relevant msgid into the kestrel message queue even if the message is pushed successfully; Mongrel does not use any way to notify the author and reader to pull the message again.

The above working mode reflects the concept of twitter architecture design split:

(1) split a complete process into independent sub-processes, and one work can be carried out by each service (the three-tier architecture itself is a split)

(2) collaboration between multiple machines to refine data flow and control flow and emphasize their separation

The separation of twitter business processes is an event-driven design, mainly reflected in two aspects:

(1) in the separation of Mongrel and mysql, the former does not directly interfere with the operation of mysql, but entrusts memcached with full responsibility.

(2) logical separation of upload and download: transmit instructions only through the kestrel queue

Users post content on Twitter all the time, and Twitter's job is to plan how to organize the content and send it to users' fans.

Real-time is the real challenge, and presenting the message to fans within 5 seconds is the goal at this stage.

Delivery means content, input to the Internet, and then send and receive as soon as possible.

Delivery puts persistent data into the storage stack, pushes notifications, triggers emails, iOS, BlackBerry and Android phones can be notified, as well as text messages.

Twitter is the largest active information transmitter in the world.

Recommendation is a great driving force for content generation and rapid dissemination.

There are two main timelines: user's and home page's.

The content sent by a specific user in the user's timeline.

The home page schedule is all the content you follow users post over a period of time.

The online rule goes like this: @ others will be quarantined if they are not followed by @, and a retweet can be filtered out.

This is a challenge to the system in Twitter.

1.Pull mode

A targeted timeline. Like the twitter.com home page and home_timeline 's API. You ask it to get the data. Pull a lot of requests: get data from Twitter through REST API requests.

Query the timeline, search the API. Query and return all matching tweets as soon as possible.

2.Push mode

Twitter runs the largest real-time event system, with an egress bandwidth of 22MB/ seconds.

Establish a connection with Twitter and it will push all messages within 150ms to you.

At almost any time, there are about a million connections on the Push service cluster.

It is sent to the exit like a search, and all public messages are sent in this way.

No, you can't. (actually can't handle that much.)

User stream connection. Mac versions of both TweetDeck and Twitter pass through here. When logging in, Twitter will check your social graph, only push messages from people you follow, and rebuild the home page timeline, rather than using the same timeline during a persistent connection.

When the query API,Twitter receives a continuous query, if a new tweet is posted and the query criteria are met, the system will send the tweet to the appropriate connection.

3. Timeline based on Pull (pull mode) from high point of view:

A short message (Tweet) is passed in through a write API. Through load balancing and a TFE (the first segment of the short message), and some other facilities that are not mentioned.

This is a very direct path. Fully pre-calculate the timeline of the home page. All business logic is executed by the time the short message comes in.

This is followed by the fan out (sending out short message) process. Incoming short messages are placed on a large number of Redis clusters. Three copies are copied on three different machines in each short message. A large number of machine failures occur every day in Twitter.

Fan out query Flock-based social graph service. Flock maintains a list of followers and followers.

Flock returns a social graph to the recipient and then starts traversing all the timelines stored in the Redis cluster.

The Redis cluster has several terabytes of memory.

Connect to the destination of 4K at the same time.

Use the native linked list structure in Redis.

Suppose you send a short message and you have 20K fans. What the Fan out background process needs to do is to find out the location of these 20K users in the Redis cluster. Then it begins to inject the ID of short messages into all these lists. So for each write of a short message, there are 20K writes across the entire Redis cluster.

What is stored is the ID of the short message, the ID of the user of the original short message, and 4 bytes to identify whether the short message is a resend or reply or something.

The timeline of your home page is stationed in the Redis cluster, with 800 records long. If you turn back many pages, you will reach the upper limit. Memory is a resource constraint that determines how long your current set of short messages can be.

Each active user is stored in memory to reduce latency.

Active users are twitter users who have logged in within the last 30 days, and this standard will change according to the usage of twitter's cache.

Only the timeline of your home page will be stored on disk.

If you fail on the Redis cluster, you will enter a process called rebuilding.

Novelty search social graph service. Find out who you care about. Query the disks for everyone and put them in the Redis.

MySQL handles disk storage through Gizzard, and Gizzard abstracts SQL transactions to provide global replication.

By copying 3 times, when a machine encounters a problem, there is no need to rebuild the timeline on that machine in each data center.

If one short message is forwarded by another, a pointer to the original short message will be stored.

When you query the timeline of your home page, the timeline service will be queried. The timeline service will only find one machine where your timeline is located.

Run 3 different hash rings efficiently because your timeline is stored in 3 places.

They find the fastest first and return as fast as they can.

The compromise that needs to be made is that fanout will take more time, but the read process is fast. It takes about 2 seconds from cold cache to browser. For an API call, probably 400ms.

Because the timeline contains only short message ID, they must "synthesize" these short messages and find the text of these short messages. Because a group of ID can do a multiple fetch, you can get short messages from T-bird in parallel.

Gizmoduck is a user service and Tweetypie is a short message object service. Each service has its own cache. User caching is a memcache cluster that has the basic information of all users. Tweetypie stores about the last month and a half of short messages in the memcache cluster. These are exposed to internal users.

There will be some read-time filtering at the boundary. For example, Nazi content is filtered out in France, so there is a read-time content stripping work before it is sent.

At this point, I believe that you have a deeper understanding of the "Twitter server data request processing process", might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report