Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the application status of Hadoop at home and abroad?

2025-03-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the current situation of the application of Hadoop at home and abroad". In the daily operation, I believe that many people have doubts about the current situation of the application of Hadoop at home and abroad. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts of "what is the status quo of the application of Hadoop at home and abroad?" Next, please follow the editor to study!

Application status of Hadoop at home and abroad

2015-04-23

Abstract: Hadoop is an open source and efficient cloud computing infrastructure platform, which is not only widely used in the field of cloud computing, but also can support search engine services as the underlying infrastructure system of search engines. at the same time, it is more and more popular in massive data processing, data mining, machine learning, scientific computing and other fields. This paper will describe the main application status of Hadoop at home and abroad.

Application status of Hadoop abroad

1.Yahoo

Yahoo is the biggest supporter of Hadoop. As of 2012, the total number of Hadoop machines in Yahoo exceeded 42000,000, and more than 100000 of the core CPU were running Hadoop. The largest single Master node cluster has 4500 nodes (each node has two-way 4-core CPUboxesw,4 × 1TB disks, 16GBRAM). The total storage capacity of the cluster is larger than that of 350PB, with more than 10 million jobs submitted each month, and more than 60% of Hadoop jobs in Pig are written and submitted in Pig.

The Hadoop application of Yahoo mainly includes the following aspects:

Support advertising system

User behavior analysis

Support for Web search

Anti-spam system

Members against abuse

Content agility

Personalized recommendation

At the same time, Pig studies and tests the Hadoop system that supports very large node clusters.

2.Facebook

Facebook uses Hadoop to store internal logs and multidimensional data as a data source for reporting, analysis, and machine learning. At present, there are more than 1400 machine nodes in the Hadoop cluster, with a total of 11,200 core CPU, exceeding the original 15PB storage capacity. Each business machine node is equipped with 8-core CPU,12TB data storage, mainly using StreamingAPI and JavaAPI programming interfaces. At the same time, Facebook has established an advanced data warehouse framework called Hive on the basis of Hadoop, and Hive has officially become an Apache first-level project based on Hadoop. In addition, a FUSE implementation on HDFS is developed.

3.A9.com

A9.com builds a commodity search index for Amazon using Hadoop, mainly using StreamingAPI as well as C++, Perl and Python tools, while using Java and StreamingAPI to analyze and process millions of daily sessions. The indexing service built by A9.com for Amazon runs on a Hadoop cluster of about 100 nodes.

4.Adobe

Adobe mainly uses Hadoop and HBase, which is the same as supporting social service computing and structured data storage and processing. A Hadoop-HBase production cluster with approximately more than 30 nodes. Adobe stores data directly and continuously in HBase, runs MapReduce job processing with HBase as a data source, and then stores the results directly to HBase or external systems. Adobe has applied Hadoop and HBase to the production cluster since October 2008.

5.CbIR

Since April 2008, CbIR (Content-basedInformationRetrieval) of Japan has used Hadoop on AmazonEC2 to build an image processing environment for image product recommendation systems. The source database is generated by Hadoop environment, which is convenient for Web applications to access it quickly, and Hadoop is used to analyze the similarity of user behavior.

6.Datagraph

Datagraph mainly uses Hadoop to batch process a large number of RDF data sets, especially using Hadoop to index RDF data. Datagraph also uses Hadoop to execute long-running offline SPARQL queries for customers. Datagraph uses AmazonS3 and Cassandra to store RDF data input and output files, and has developed a Ruby framework for processing RDF data based on MapReduce-RDFgrid.

Datagraph mainly uses Ruby, RDF.rb and self-developed RDFgrid framework to deal with RDF data, mainly using HadoopStreaming interface.

7.EBay

A single cluster has more than 532 nodes, a single node has 8 cores of CPU, and its capacity exceeds 5.3PB storage. MapReduce's Java interface, Pig, Hive are widely used to deal with large-scale data, and HBase is also used for search optimization and research.

8.IBM

IBM Blue Cloud also uses Hadoop to build cloud infrastructure. The technologies used by IBM Blue Cloud include Xen and PowerVM virtualized Linux operating system images and Hadoop parallel workload scheduling, and released its own Hadoop distribution and big data solution.

9.Last.Fm

Last.Fm is mainly used for chart calculation, patent filing, log analysis, Amax B test, data set merging, etc., and also uses Hadoop for large-scale audio feature analysis of more than a million tracks.

The node has more than 100 machines, and the cluster node is equipped with dual quad-core XeonL5520@2.27GHzL5630@2.13GHz,24GB memory and 8TB (4×2TB) storage.

10.LinkedIn

LinkedIn has Hadoop clusters with multiple hardware configurations. The main cluster configurations are as follows:

800node cluster, HP SL170X based on Westmere and 2 × 4 core, 24GB memory, 6 × 2TBSATA.

1900-node cluster, Westmere-based ultramicro-HX8DTT, with 2 × 6 core, 24GB memory, 6 × 2TBSATA.

1400 node cluster, based on SandyBridge ultramicro and 2 × 6 core, 32GB memory, 6 × 2TBSATA.

The software used is as follows:

The operating system uses RHEL6.3.

JDK uses SUNJDK1.6.0_32.

Patch for Hadoop0.20.2 for Apache and patch 1.0.4 for ApacheHadoop.

Azkaban and Azkaban are used for job scheduling.

Hive, Avro, Kafka, etc.

11.MobileAnalytic.TV

Hadoop is mainly used in the field of parallel algorithms, and the MapReduce application algorithms involved are as follows.

Information retrieval and analysis.

Machine-generated content-documents, text, audio, video.

Natural language processing.

The portfolio includes:

Mobile social networks.

Web crawler.

Text to speech conversion.

Audio and video are generated automatically.

12.Openstat

Mainly use Hadoop to customize a web log analysis and generate reports. In its production environment, there are more than 50 node clusters (dual quad-core Xeon processors, 16GB RAM,4~6 hard drives), and two relatively small clusters for personalized analysis, handling about 5 million events per day and transaction data of US $1.5 billion per month. The cluster generates about 25GB reports every day.

The main technologies used include: CDH, Cascading, Janino.

13.Quantcast

3000 CPU cores, 3500TB storage, daily processing of data above 1PB, Hadoop scheduler with fully customized data paths and sequencers, makes an outstanding contribution to the KFS file system.

14.Rapleaf

Cluster with more than 80 nodes (each node has 2 dual-core CPU,2TB × 8 storage, 16GBRAM memory); mainly uses Hadoop and Hive to process personal data on Web, and introduces Cascading to simplify data flow through various processing stages.

15.WorldLingo

There are more than 44 servers on hardware (each with 2 dual-core CPU,2TB storage, 8GB memory), each server running Xen, starting a virtual machine instance to run Hadoop/HBase, and then starting a virtual machine instance to run Web or application server, that is, 88 available virtual machines; running two separate Hadoop/ HBase clusters, each with 22 nodes. Hadoop is mainly used to run HBase and MapReduce jobs, scan HBase data tables, and perform specific tasks. As a scalable and fast storage backend, HBase is used to save millions of documents. 12 million documents are currently stored, and the near-term goal is to store 450 million documents.

16. TerrierTeam of the University of Glasgow

Experimental cluster with more than 30 nodes (each node is configured with XeonQuadCore2.4GHz,4GB memory, 1TB storage). Use Hadoop to facilitate information retrieval research and experimentation, especially for TREC and for the TerrierIR platform. The open source distribution of Terrier includes a large-scale distributed index based on HadoopMapReduce.

17. HollandComputingCenter of the University of Nebraska

A medium-sized cluster of Hadoop machines (total 1.6PB storage) is run to store and provide physical data to support the calculation of compact muon helical magnetic spectrometer (CompactMuonSolenoid,CMS) experiments. This requires the support of a file system that can download data at several Gbps speeds and process data at a higher speed.

18.VisibleMeasures

Use Hadoop as a component of an extensible data pipeline and eventually use it in products such as VisibleSuite. Use Hadoop to summarize, store, and analyze data streams related to the viewing behavior of online video viewers. The current grid includes more than 128 CPU cores, more than 100TB storage, and plans to expand significantly.

Application status of Hadoop in China

The application of Hadoop in China is mainly Internet companies. The following mainly introduces the companies that use Hadoop or study Hadoop on a large scale.

1. Baidu

Baidu began to pay attention to Hadoop and began to investigate and use it in 2006. in 2012, its total cluster size reached nearly 10, a single cluster has more than 2800 machine nodes, Hadoop machines have tens of thousands of machines, the total storage capacity exceeds 100PB, it has used more than 74PB, thousands of jobs are submitted every day, the amount of input data per day has exceeded 7500TB, and the output has exceeded 1700TB.

Baidu's Hadoop cluster provides unified computing and storage services for the company's data team, large search team, community product team, advertising team, and LBS community. The main applications include:

Data mining and analysis.

Log analysis platform.

Data warehouse system.

Recommendation engine system.

User behavior analysis system.

At the same time, Baidu developed its own log analysis platform, data warehouse system and unified C++ programming interface on the basis of Hadoop. It also deeply reformed Hadoop and developed HadoopC++ extended HCE system.

two。 Alibaba

Alibaba's Hadoop cluster has about 3200 servers as of 2012, with about 30,000 physical CPU cores, total memory 100TB, total storage capacity exceeding 60PB, more than 150,000,000 daily jobs, more than 6000 hivequery queries per day, daily scanning data volume of about 7.5PB, daily scanning files of about 400 million, storage utilization of about 80%, average CPU utilization of 65%, and a peak of 80%. Alibaba's Hadoop cluster, with 150 user groups and 4500 cluster users, provides basic computing and storage services for Taobao, Tmall, Yitao, Juhuasuan, CBU and Alipay. The main applications include:

Data platform system.

Search for support.

Advertising system.

Data Rubik's Cube.

Quantum statistics.

Amoy data.

Recommendation engine system.

Search the rankings.

In order to facilitate the development, it also developed the WebIDE inheritance development environment, the related systems used include: Hive, Pig, Mahout, Hbase and so on.

3. Tencent

Tencent is also one of the earliest Chinese Internet companies to use Hadoop. By the end of 2012, Tencent had a total of more than 5000 Hadoop cluster machines, the largest single cluster of about 2000 nodes, and built its own data warehouse system TDW using Hadoop-Hive. At the same time, it also developed its own TDW-IDE basic development environment. Tencent's Hadoop provides basic cloud computing and cloud storage services for various product lines of Tencent. It supports the following products:

Tencent social advertising platform.

SOSO (search).

Pat the net.

Tencent Microblog.

Tencent compass.

QQ VIP.

Tencent Games supports.

QZone.

Network of friends.

Tencent open platform.

Tenpay.

Mobile QQ.

QQ Music.

4. Qihoo 360

Qihoo 360 mainly uses Hadoop-HBase as the underlying web page storage architecture system of his search engine so.com. The number of web pages searched can reach 100 billion, and the amount of data is at the PB level. By the end of 2012, the size of its HBase cluster is more than 300 nodes, the number of region is more than 100000, and the platform version used is as follows.

HBase version: facebook0.89-fb.

HDFS version: facebookHadoop-20.

Qihoo 360's work on Hadoop-HBase is mainly to optimize and reduce the start and stop time of HBase cluster, and to optimize and reduce the recovery time after abnormal exit of RS.

5. Huawei

Huawei is also one of the major contributors to Hadoop, ahead of Google and Cisco. Huawei has in-depth research on Hadoop's HA solution and HBase, and has launched its own Hadoop-based big data solution to the industry.

6. China Mobile

China Mobile officially launched Dayun BigCloud1.0 in May 2010, with 1024 cluster nodes. China Mobile's Big Cloud implements distributed computing based on Hadoop's MapReduce and uses HDFS to achieve distributed storage. It also develops a Hadoop-based data warehouse system HugeTable, parallel data mining tool set BC-PDM, parallel data extraction and transformation BC-ETL, object storage system BC-ONestd and other systems, and open source its own BC-Hadoop.

Version.

China Mobile mainly applies Hadoop in the field of telecommunications, and its planned application areas include:

After divided into KPI centralized operation.

Through the subsystem ETL/DM.

Settlement system.

Signaling system.

Cloud computing resource pool system.

Internet of things application system.

E-mail .

IDC service and so on.

7. Pangu search

Pangu search (currently merged with Instant search into China search) mainly uses Hadoop cluster as the infrastructure support system of search engine. As of the beginning of 2013, the total number of machines in the cluster is more than 380 and the total storage is 3.66PB. The main applications are as follows.

Web page storage.

Web page parsing.

Build an index.

Pagerank calculation.

Log statistical analysis.

Recommendation engine, etc.

Search now (people's search)

Instant search (currently merged with Pangu search into China search) also uses Hadoop as the supporting system of its search engine. As of 2013, its Hadoop cluster has a total size of more than 500 nodes, configured with dual-channel 6-core CPU,48G memory and 11 × 2T storage. The total capacity of the cluster exceeds 10PB, with a utilization rate of about 78%. The amount of data read per day is about 500TB, with a peak value of more than 1p, and an average of about 300TB.

Instant search uses sstable format to store web pages in the search engine and stores sstable files directly on HDFS. It mainly uses HadoopPipes programming interface for subsequent processing, and also uses Streaming interface to process data. The main applications include:

Web page storage.

Parsing.

Build an index.

Recommendation engine.

At this point, the study on "what is the current situation of the application of Hadoop at home and abroad" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report