How to understand the Hadoop Cluster Application of eBay and big data Management 05/03 Update SLTechnology News&Howtos

How to understand the Hadoop Cluster Application of eBay and big data Management

2025-05-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to understand eBay's Hadoop cluster application and big data management". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to understand eBay's Hadoop cluster application and big data management".

EBay, the world's largest online trading platform, was founded by programmer Pierre Omidyar in the United States at the weekend of International Labour Day in 1995. Originally called AuctionWeb, it was officially renamed eBay in July 1997. It will celebrate its 20th anniversary in September this year.

EBay's first deal was a broken laser pen, which sold for $14.83. Pierre contacted the buyer to make sure he knew it was a broken laser pen, and the buyer replied, "I am a broken laser pen collector." Since then, the 20-year development of eBay has officially begun, leading the rapid growth of the e-commerce industry. today, eBay has become the world's largest online trading site, with buyers in more than 190 countries, more than 25 million active sellers, 157 million active buyers, 800 million active goods, connecting buyers and buyers around the world through Connected Commerce, generating more than $255 billion in GMV in 2014. Of this amount, the GMV from the mobile side is more than $28 billion. According to statistics, a handbag is sold every five seconds in the United States, a pair of shoes every minute through the mobile terminal in Australia, and a car or truck every 10 minutes through the mobile terminal in Germany.

With such a large number of users and transactions, data has become the top priority of eBay, from clickstream to search, merchandise viewing, transactions and wish lists. More than 100PB data is stored in the eBay data platform, the key is how to obtain, store, process and analyze the data, and release the value of the data to make it a guide to action, while each big data platform provides a solid guarantee and foundation for tens of thousands of analysts and business users, and constantly innovates to meet the ever-changing changes and needs.

EBay's current big data platform is divided into three layers, data integration layer: responsible for data acquisition, processing and cleaning and other ETL work, including batch and real-time processing capabilities, including related commercial and open source products; data platform layer: mainly by traditional data warehouse (EDW), based on Teradata cluster, with a total capacity of more than 10PB; singularity (Singularity), storing semi-structured and deep structured data storage, with a total capacity exceeding 36PB And Hadoop cluster, with a total capacity of more than 100PB; data access layer: through a variety of tools, the platform provides business users and analysts with the ability to access and analyze relevant data, including a variety of business tools, open source products and self-developed platforms. This article will focus on the development, platform and future development trend of eBay in related fields.

Connect with Hadoop

1. The development history of Hadoop in eBay.

The earliest Hadoop application of eBay is built in eBay Research Laboratory (eBay Research Lab, eRL). It is mainly used for log analysis in order to improve the speed of log processing every day. The original version was 0.18.2, with 4 nodes, storing and processing hundreds of GB of logs, with a maximum processing capacity of 44 Map.

Subsequently, the eBay search team built a 10-node cluster to start the development of Hadoop in the field of eBay search, and launched a HBase-based search platform: Cassini in 2012.

In 2010, eBay launched a CDH2-based cluster with 532 nodes, exceeding the storage capacity of 5PB, and in 2012, it launched a HDP-based cluster with more than 3000 nodes, with a capacity exceeding 50PB. In 2014, the total number of nodes exceeded 10000, the storage capacity exceeded 170PB, and the number of active users exceeded 2000. Now, the relevant scale is still growing. The challenges of management, monitoring, analysis and storage are becoming more and more serious.

Infrastructure innovation led the evolution of Hadoop, evolving from the initial batch applications based on HDFS and MapReduce, the first generation of Hadoop provided flexible and scalable data structures and processing capabilities, and democratized the company's data processing needs of all sizes at the time of big data's rise. However, after all, it is only the first step, with various limitations. If you compare it to the operating system, the first generation of Hadoop is like an operating system and applications, such as notepad, and there is only one application, that is, MapReduce. However, a large number of tasks lead to scheduling bottlenecks, which leads to the establishment and development of YARN (Yet Another Resource Negotiator) project, which solves the problem that JobTracker becomes a bottleneck in super-large-scale centralization, and supports various applications to schedule and manage resources through YARN, thus bringing Hadoop into the next era.

The next generation of Hadoop has made a huge leap from batch-oriented to providing interactive processing power. It also provides strategic decisions to support independent execution modes, such as MapReduce can run as an application on YARN. From then on, it became a real data operating system through YARN,Hadoop.

Now, data from transactional database, document database and graph database can be stored on Hadoop, and data can be accessed through YARN-based applications without copying or moving data in different applications, including MapReduce, Hive, HBase, Spark and other applications. Thus it provides a very rich ability of data processing and innovation. A unified data storage, the use of the platform will be a certain trend.

two。 Hierarchical storage

At present, the common perception is to use cheap hardware to build Hadoop clusters to store super-large capacity data and provide computing power. For example, if a 1000-node cluster has the storage capacity of 20TB, the whole cluster can store 20PB data. All machines have enough computing power to implement Hadoop's famous saying: "Moving Computation is Cheaper than Moving Data".

Different types of datasets are usually stored in the same cluster and shared by different teams to run various applications to meet business needs. A common feature of data is that its utilization rate decreases gradually with time, the newer the data utilization rate is, the higher the data utilization rate is, and the older the data access times are, the lower the access times are. The data generated for the first time has the greatest utilization, and we define it as Hot. Based on our analysis, the data with declining traffic within a week is called Warm, while the data with only a small number of visits within the next three months is called Cold. Finally, a dataset whose access rate is reduced to only a few times a year or even zero is called Frozen, as shown in the following table:

According to this analysis, it becomes more and more problematic to store different hot data in the same cluster and use the same computing and storage resources. With the growth of time, more and more cold data will occupy valuable storage and computing resources. When more hot data needs to come in or do a large number of calculations, the corresponding storage becomes a bottleneck, and many other companies even mention practices such as deleting low-value data. In the management and operation of super-large Hadoop clusters, how to deal with different heat data has become a very urgent demand and practical challenge.

The strategy of storing low-hot data sets differently from high-hot data sets is imperative. In Hadoop 2.3, HDFS supports hierarchical storage, provides deep storage capacity for cold data by adding archive storage capabilities in the cluster, and maintains transparency to upper-layer applications. Because the data is still in the same cluster, it can still be obtained in time when the request needs to access the corresponding cold data. For example, we can add 100 nodes for the above example, each with 200TB storage but using only limited computing resources, so the total amount of the entire cluster will become 40PB (20PB disk + 20PB archive). Through the relevant data strategy, distribute the data of different heat to different storage. For example, suppose three copies of each data are copied according to the default setting of Hadoop, for hottype data, all three copies are stored on fast disk, for Warm type data, only one copy is stored on fast disk, and the other two copies are stored in archive storage, and all Cold and Frozen data are stored in archives. In order to effectively allocate different data, the example is as follows:

Tiered storage is already used on eBay's largest Hadoop cluster, which has 40PB storage, and we have added additional 10PB storage to it, with 220TB capacity attached to each node, thus gradually migrating the Warm, Cold, and Frozen datasets. Because of the limited computing power, the cost per GB of these nodes is about 4 times cheaper than that of other nodes. In the future, eBay will continue to conduct research and investment in this area, such as SSD storage and so on.

3. Monitoring, alarm and automatic operation and maintenance

When the number of clusters reaches tens of thousands of scale, monitoring, alarm and automatic operation and maintenance are the basis to ensure high data availability and provide continuous services for upper-level applications. In the daily work of eBay, the management and maintenance of Hadoop cluster is very onerous, but the existing management and monitoring tools can not meet the needs of multi-cluster, large-scale and distributed log collection and monitoring data. Therefore, eBay developed a cluster monitoring and alarm platform called Eagle.

Eagle is mainly composed of the basic core framework and many app for different application fields, focusing on solving the complex big data problem of self-monitoring of large-scale distributed systems in big data era. It has the characteristics of high expansibility, high real-time performance and high availability. At the same time, it supports the use of machine learning to provide predictive analysis for complex situations.

Lightweight distributed flow processing framework: abstracts the general flow processing paradigm based on DAG. During the development period, users only need to define the streaming logic of the monitor program based on DSL API, and then select the actual physical execution environment at run time. It supports single process and Storm by default, and also supports extensions to other execution environments, such as Spark Streaming or Flink.

Real-time stream aggregation engine: provides easy-to-use real-time stream aggregation rule definition syntax, metadata-driven, dynamic deployment, and realizes linear expansion of real-time monitoring data stream aggregation.

Distributed Policy engine: distributed real-time early warning rule execution engine, which provides many extensions such as SQL-like descriptive rule definition syntax and machine learning automation, and supports dynamic loading and partitioning of early warning rules.

Storage and query framework: general monitoring data storage framework, can be used to store and query logs, indicators, alarms, events and many other types of data, default support HBase, and HBase for a variety of optimization and expansion, such as coprocesser, secondary index and partition, etc., also supports other storage types of extensions such as RDBMS, and provides general ORM, REST API and easy-to-use strong SQL query syntax.

Customizable monitoring reports: provide Notebook-like interactive real-time visual analysis, also support further selection of some icons, and define the layout to be saved as dashboard for sharing or continuous monitoring.

In addition to monitoring daily cluster metrics, Eagle integrates Job Performance Analyzer (JPA), which provides multi-dimensional and different granularity performance analysis by real-time monitoring the current and historical execution status of jobs on the Hadoop platform, and supports a variety of exception early warning and performance warnings, such as job running time is too long, read and write too slow, data skew, too many failed tasks, etc., which can effectively provide early warning and performance advice before jobs fail to meet SLA.

At the same time, combined with the machine learning model, based on task distribution or index changes and other possible potential anomalies such as tasks or server nodes, and integrate the Remediation system to automatically repair the system. At the same time, aiming at abnormal user behavior and dangerous operation, a security monitoring application of Eagle DAM (Data Activities Monitoring) is developed, which monitors and alarms key data and operations through user-defined policies and machine learning models.

4. Online interaction analysis

As the data size grows with the diversification of the user base, our users, such as analysts and business units, want to continue to use the tools and methods they are familiar with to access and analyze very large datasets stored on Hadoop while maintaining the lowest level of latency, and hope that data acquisition, processing, storage and analysis can be done on the Hadoop cluster at the same time. There is no need to migrate data from one data source to another. After studying and evaluating a variety of open source and commercial products, eBay China R & D Center formally launched the OLAP on Hadoop project in mid-2013, opened it in October 2014, and then contributed to the Apache Foundation, which is now in the incubation stage.

Apache Kylin maps the star-shaped table in Hive, and the modeler defines the relevant dimensions, metrics and other settings to generate metadata. The construction engine automatically generates relevant Hive queries, a series of MapReduce tasks and HBase operations based on the metadata, so that the data is read out from the Hive and pre-calculated, and the results are stored in HBase. After that, the queries of the same data model will directly read the calculated data stored in HBase, thus achieving second or even sub-second query delay.

In the initial phase of the project, we investigated and evaluated a variety of open source and commercial options, including Impala,Stinger,Phoenix on HBase, Teradata,MicroStrategy and so on, and finally found that none of them can meet the actual business needs of eBay and provide second-level interactive query capability for very large data sets. After studying many technologies, papers and reference implementations, the development team finally chose MOLAP, that is, to pre-calculate the data model and exchange space for time to provide front-end business users and analysts with interactive query capabilities on TB or even PB-level datasets.

In the above topology diagram, the lowest node is the actual data, while each node above represents a combination of dimensions. Theoretically, all SQL queries can be covered by the topology diagram, so after the relevant pre-calculation, as long as the engine can correctly parse the query statement and access the correct data storage address, the results can be obtained in a very short time. In the actual development process, the Kylin system effectively reduces the dimension, reduces the calculation of unnecessary combinations, and adds a variety of compression and coding algorithms, such as Trie dictionary coding technology, Partial Cube computing, packet aggregation and so on. In the actual production environment, the query delay of 90%ile is 1.5 seconds, and the 95%ile is less than 5 seconds (the last 30 days).

Although the MOLAP-based application system has provided relevant business users with query applications on large-scale data sets, because it takes a lot of system resources and time to build Cube, on the one hand, it brings great pressure on the cluster, on the other hand, it is difficult to meet the high demand for real-time. Therefore, the Kylin team has developed a next-generation architecture for this purpose, which is supported by Micro Batch mode streaming data. As shown in the following figure, data from the upper data stream is read at regular intervals and promoted aggregation, and finally imported into the target Cub Russia. At present, relevant cases have been launched within eBay and good feedback has been obtained.

In addition, a new algorithm is also introduced for the Cube engine, and the measured results show that it can provide more than twice the construction speed and greatly reduce the requirements for system resources. In addition, we have also invested in related research on Spark, the first version of the Spark Cubing engine industry to complete and ready for online testing.

5. Data ecology

The above briefly introduces the development and main practice of eBay in the big data platform in recent years. The development and construction of the basic platform is inseparable from the help and guidance of users, partners and management. In this process, the data ecology based on Hadoop and enterprise data warehouse is gradually built, and each business unit The analysis team uses relevant platforms and data to provide rich analysis and decision support for rapidly changing businesses and fast-growing data to jointly build the big data ecology of eBay.

Connect everyone.

Through big data platform and application, eBay can provide buyers and sellers with better user experience and services, constantly meet the changing market and environment, and reduce the impact and dependence on the environment through innovative technology. Today, eBay knows you, and tomorrow, eBay will understand you and connect you to the future.

EBay's Secret weapon: stimulating the desire to buy with big data

1.8 million buyers and sellers are active on eBay, and the site generates a lot of data every day. About 3.5 million items will be on the market at any given point in time, with more than 2.5 million queries a day through eBay's auction search engine. Hugh Williams, vice president of eBay search platform, said that eBay's Hadoop clusters and Teradata devices usually hold 10PB's raw data. Online auction site eBay uses many of big data's functions, such as measuring site performance and detecting fraud. But one of the more interesting uses of collecting large amounts of data is to encourage users to buy more goods on the site.

Although eBay cannot force users to buy every product they encounter, eBay makes full use of big data's advantages to carry out strong promotions. One of the ways is to optimize search engines and search results, analyze users' behavior patterns through the collected data, and adjust the results.

"if you go back a few years and use a search engine on eBay, you may find it too literal," Williams said. "there are things you can say to the search engine that it literally finds the information the user needs, but it doesn't really understand the user's intention."

"We have been trying to make our search engine more intuitive." For example, by using big data, eBay found that if users want to buy a Pilzlampe, which is a collectible German mushroom lamp, they are more likely to buy it when they type "pilz lampe" into the eBay search engine, because the input will produce more results.

In search engines, by simply adding a space bar in the middle of a word, eBay can improve sales opportunities through the website. With this information, eBay changes and rewrites users' search queries through its search engine, adding synonyms and alternative terms to bring more relevant results.

Not only that, eBay uses big data to predict whether the listed products will be sold and what price they will sell, and how much impact this will have on the search engine of the auction site.

All of this can increase the possibility for users to buy.

Wlilliams believes that the implementation factors of the way search queries are shaped are risky. "it takes months of engineering to implement a factor, and it is very risky because we don't know if it is really useful to our customers when helping our customers find a project," he said. This is why eBay usually runs tests on websites to get a sample group of users to measure the response.

Another challenge is to take into account the environment of the search query. One example is that if users look for "GeelongCats", eBay's search engine may just use "Cat" as a keyword and search in pet categories-which is not of much use when users are searching for sporting goods.

"very subtle problems can arise under our control, so we need data for scientists to study," Williams said.

Thank you for reading, the above is the content of "how to understand eBay's Hadoop cluster application and big data management". After the study of this article, I believe you have a deeper understanding of how to understand eBay's Hadoop cluster application and big data management, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.