What software does big data have to deal with architecture and applications in Facebook? 04/19 Update SLTechnology News&Howtos

What software does big data have to deal with architecture and applications in Facebook?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what big data processing architecture and application software does Facebook have". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what big data processing architecture and application software does Facebook have"?

The Evolution Route of Facebook big data's Technology Architecture

Facebook has always been the most active user of big data's technology, because it has a huge amount of data. One source shows that in 2011, it already has 25PB compressed data, uncompressed data 150PB, and uncompressed new data generated every day is 400TB. In Facebook, big data technology is widely used in advertising, news sources, message / chat, search, site security, specific analysis, reporting and other fields. Facebook is also one of the biggest contributors to Apache big data's open source project. Facebook officially turned to the Hadoop computing framework around 2007, and then it contributed the famous open source tools such as Hive, ZooKeeper, Scribe, Cassandra and so on to the Apache Foundation. At present, the open source process of Facebook is still actively advancing. The technical architecture of Facebook big data has gone through three stages of evolution.

The early big data technology architecture of Facebook is based on open source tools such as Hadoop, HBase, Hive, Scribe and so on. The log data stream is generated from the HTTP server, and the log collection system Scribe takes seconds to transfer to the shared storage NFS file system, and then uploads the data files to Hadoop through the hourly Copier/Loader (that is, MapReduce jobs). Data abstracts are generated through daily pipelining, which is based on Hive-like SQL language development, and the results are regularly updated to the front-end Mysql server to generate reports through OLTP tools. There are 3000 Hadoop cluster nodes, and the problems of scalability and fault tolerance can be well solved, but the main problem of the early system is that the overall processing delay is large, and the final report can only be obtained after 1-2 days from the log generation.

The current big data technical architecture of Facebook optimizes the data transmission channel and data processing system based on the early architecture, as shown in the figure, it is mainly divided into distributed log system Scribe, distributed storage system HDFS and HBase, distributed computing and analysis system (MapReduce, Puma and Hive) and so on.

The Scribe log system is used to aggregate log data from a large number of HTTP servers. Thrift is a software framework provided by Facebook, which is used for cross-language service development and can provide seamless support between C, Java, PHP, Python and Ruby languages. Thrift RPC is used to call Scribe log collection service to summarize log data. Scribe Policy is a log traffic and model management node that transmits metadata to Scribe clients and Scribe HDFS, and the collected log data is stored in Scribe HDFS. The data channel optimized by Facebook for the early system is called Data Freeway, which can handle peak 9GB/s data with end-to-end latency of less than 10s and supports more than 2500 log types. Data Freeway mainly consists of four components, Scribe, Calligraphus, Continuous Copier and PTail. Scribe is used on the client side and is responsible for sending data through Thrift RPC; Calligraphus combs the data in the middle tier and writes it to HDFS, which provides the management of log types and assists with Zookeeper; Continuous Copier copies files from one HDFS to another HDFS;PTail parallel tail directories on multiple HDFS, and writes file data to standard output. In the current architecture, part of the data processing is still processed in batches through MapReduce, stored in the central HDFS, and analyzed and processed by Hive every day. Another part of the near real-time data stream is processed at the minute level through Puma. Facebook provides Peregrine (Hipal) tools for specialized analysis and Nocron tools for periodic analysis.

The embryonic form of the future big data technical architecture of Facebook has come out. First of all, the open source is the Corona that may replace MapReduce in the Hadoop system, similar to the YARN proposed by Yahoo. One of the biggest advances of Corona is that its cluster manager manages the required resources based on CPU, memory and other job processing, which enables Corona to handle both MapReduce jobs and non-MapReduce jobs, making Hadoop clusters more widely used. The second is Facebook's latest interactive big data query system Presto, which is similar to Cloudera's Impala and Hortonworks's Stinger, which solves the rapid query demand of Facebook's rapidly expanding massive data warehouse. According to Facebook, a simple query using Presto takes only a few hundred milliseconds, and even a very complex query takes only a few minutes to complete. It runs in memory and does not write to disk. The third is Wormhole stream computing system, which is similar to Twiitter's Storm and Yahoo's Storm-YARN. The fourth important project is Prism, which can run a large Hadoop cluster that connects data centers around the world, and may redistribute data immediately when a data center goes down, which is similar to Google's Spanner.

The evolution path of big data technology architecture of Facebook represents the development route of big data technology. It is commendable that open source is the consistent route of Facebook, which, together with Yahoo and other companies, has made great contributions to the development of big data technology.

The software used by Facebook

In some ways, Facebook is still a LAMP type of site, but in order to cooperate with a large number of other components and services, Facebook has made the necessary changes, extensions and modifications to the existing methods.

For example:

Facebook still uses PHP, but Facebook has rebuilt a new compiler to load native code on its Web server, thereby improving performance

Facebook uses the Linux system, but has also made the necessary optimizations for its own purposes. (especially in terms of network throughput)

Facebook uses MySQL, but also optimizes it.

There are also customized systems, such as Haystack, a highly scalable object store for processing Facebook's large images, and Scribe-Facebook's logging system.

Here's a look at the software used by Facebook, the world's largest social networking site.

Memcached

Memcached is a very famous software. It is a distributed memory cache system. Facebook (and a large number of websites) uses it as a cache layer between the Web server and the MySQL server. Over the years, Facebook has done a lot of optimization work on Memcached and its related software, such as network stacks.

Facebook runs thousands of Memcached servers to process TB-level cached data in a timely manner. It can be said that Facebook has the largest Memcached device in the world.

HipHop for PHP

PHP runs relatively slowly compared to code running on a local server. HipHop converts PHP code into C++ code to improve compile-time performance. Because Facebook relies heavily on PHP to process information, HipHop,Facebook is even more powerful in terms of Web servers.

The birth of HipHop: at Facebook, a team of engineers (initially three) took 18 months to develop.

Haystack

Haystack is a high-performance image storage / retrieval system for Facebook. (strictly speaking, Haystack is an object store, so it doesn't have to store pictures. (the workload of Haystack is huge. There are more than 20 billion images on Facebook, each saved at four different resolutions, so Facebook has more than 80 billion pictures.

The role of Haystack is not just to deal with a large number of images, its performance is the highlight. As we mentioned earlier, Facebook processes about 1.2 million images per second, which does not include the number of images processed by its CDN. This is an astonishing figure!

BigPipe

BigPipe is a dynamic web page processing system developed by Facebook. In order to achieve optimization, Facebook uses it to deal with the fragmentation of each web page (also known as "Pagelets").

For example, chat windows are retrieved independently, and news sources are retrieved independently. These Pagelets can be retrieved concurrently, and the performance is also improved. In this way, even if a part of the site is disabled or crashed, users can still use it.

Cassandra

Cassandra is a distributed storage system without a single point of failure. It is a member of the former NoSQL movement and is now open source (has joined the Apache project). Facebook uses it for email searches.

In addition to Facebook, Cassandra is also suitable for many other services, such as Digg.

Scribe

Scribe is a flexible logging system that Facebook uses for a variety of internal purposes. Scribe purpose: handle Facebook-level logs. Once a new log classification is generated, Scribe will automatically process it. Facebook has hundreds of log categories.

Hadoop and Hive

Hadoop is an open source Map/Reduce framework that can easily handle large amounts of data. Facebook uses it for data analysis. As mentioned earlier, the amount of data in Facebook is huge. Hive originated from the fact that Facebook,Hive can use SQL queries to make it easier for non-programmers to use Hadoop. (note 1: Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and provide complete sql query functions, which can convert sql statements into MapReduce tasks to run. )

Varnish

Varnish is a HTTP accelerator that acts as a load balancer and is also used to quickly process cached content.

Facebook uses Varnish to process pictures and user photos, processing billion requests every day. Like other Facebook applications, Varnish is open source.

Facebook can run smoothly, but also benefit from other aspects.

Although some of the software that makes up the Facebook system has been mentioned above, dealing with such a large system is a complex task in itself. So, here are some of the things that make Facebook run smoothly.

Although you can't go too deep into the hardware here, hardware is definitely an important factor in the unprecedented scale of Facebook. For example, like other large websites, Facebook uses CDN to handle static content. Facebook also has a huge data center in the western state of Oregon, where servers can be added at any time.

Thank you for your reading. the above is the content of "what big data deals with architecture and applications in Facebook". After the study of this article, I believe you have a deeper understanding of what big data deals with architecture and application software in Facebook, and the specific use still needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.