What impetus does Hadoop Cluster Technology have to big data's processing in recent years 04/23 Update SLTechnology News&Howtos

What impetus does Hadoop Cluster Technology have to big data's processing in recent years

2025-04-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the promotion of Hadoop cluster technology to big data's processing in recent years". In daily operation, I believe many people have doubts about what promotion big data has to deal with in Hadoop cluster technology in recent years. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubt that "Hadoop cluster technology has promoted big data's processing in recent years." Next, please follow the editor to study!

1. Introduction

What is big data? The definition of big data given in the McKinsey report "big data: the next Frontier of Innovation, Competition and Productivity" is: big data refers to a dataset whose size exceeds the ability of existing database tools to obtain, store, manage and analyze. At the same time, it is stressed that the dataset that does not exceed a certain number of levels is big data.

International data Corporation (IDC) defines big data in four dimensions, namely, the size of the dataset (Volume), the speed of data flow (Velocity), the number of data types (Variety) and the size of data value (Value).

Amazon's big data scientist John Rauser is more straightforward: "the amount of data that exceeds the processing capacity of a single computer is big data."

Finally, let's take a look at the Wikipedia definition of big data: "Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications." big data refers to a collection of data that is so large and complex that it is difficult to process through existing database management tools or traditional data processing applications.

The word "big" is highlighted in all of the above big data's concepts. On the face of it, the increase in the scale of data does bring great problems to the processing of data. Specifically, it becomes impossible to obtain data of the same value as before at the same time. In other words, the essential problem is that the value density of data becomes lower and the rate of data exchange slows down, so it gives birth to many new data processing technologies and tools, such as Google's GFS and MapReduce,Apache Hadoop ecosystem, Berkeley AMPLab's Spark, etc.; there are computing models with different degrees of time sensitivity, such as batch computing model, interactive computing model, streaming computing model, real-time computing model and so on. The differences in computing models are only different in the technologies that determine the acquisition of value, depending on the needs of the upper-level business. In fact, the essence of the so-called big data problem should be the capitalization and service of data, and mining the intrinsic value of data is the ultimate goal of big data.

two。 Big data's technology originates from Google

The great success of Google in the search engine is largely due to the use of advanced big data management and processing technology, which is designed for the growing mass data storage problem and the massive data processing problem faced by the search engine.

Google puts forward a set of infrastructure technology based on distributed parallel cluster, which uses the ability of software to deal with node failures that often occur in clusters. The big data platform used by Google mainly includes five independent and closely integrated systems: distributed resource management system Borg,Google file system (GFS), MapReduce programming mode according to the characteristics of Google applications, distributed locking mechanism Chubby and large-scale distributed database BigTable.

Borg is the most mysterious of the five systems, and it wasn't until 2015 that Google published a paper entitled "Large-scale cluster management at Google with Borg" on EuroSys 2015. It is said that Google is not only like computing applications, such as MapReduce and Pregel, which run on Borg, but also storage applications, such as GFS,BigTable and Megastore, which really achieve the mixed deployment of batch jobs and long-term services and dynamic resource scheduling. Thanks to this technology, the average resource utilization rate can reach more than 30% to 75%, which is much higher than the industry average of 6% to 12%.

GFS is a large distributed file system, which provides massive storage for Google cloud computing, and is closely integrated with Chubby, MapReduce and BigTable technologies, which is at the bottom of the system. Its design is influenced by the special application load and technical environment of Google. Compared with the traditional distributed file system, GFS is simplified in many aspects in order to achieve the best balance of cost, reliability and performance.

MapReduce is a parallel programming mode for dealing with massive data, which is used for parallel computing of large-scale data sets. MapReduce participates in the operation through two simple concepts, "Map" and "Reduce". Users only need to provide their own Map function and Reduce function to carry out large-scale distributed data processing on the cluster. This programming environment enables programmers to write large-scale parallel applications without considering the reliability and scalability of the cluster. The application writer only needs to focus on the application itself, and the processing of the cluster is left to the platform. Compared with traditional distributed programming, MapReduce encapsulates parallel processing, fault-tolerant processing, localized computing, load balancing and other details, and has a simple and powerful interface. Because MapReduce has the commonness of functional programming language and vector programming language, this programming model is especially suitable for unstructured and structured massive data search, mining, analysis and other applications.

Chubby is a file system that provides coarse-grained lock service. It is based on loosely coupled distributed file system and solves the consistency problem of distributed system. This lock is only a suggested lock, not a mandatory lock. By using Chubby's locking service, users can ensure consistency during data manipulation. GFS uses Chubby to select a GFS master server, and BigTable uses Chubby to specify a master server and to discover and control the associated child table servers.

Large-scale distributed database BigTable is a distributed storage system based on GFS and Chubby. Many applications are very regular about the organization of data. In general, databases are very convenient for dealing with formatted data. However, because the relational database requires strong consistency, it is difficult to expand it to a large scale. In order to deal with a large amount of formatted and semi-formatted data within Google, Google built a large-scale database system BigTable with weak consistency requirements. BigTablede is similar to a database in many ways, but it is not really a database. A lot of structured and semi-structured data in Google, including Web index and satellite image data, are stored in BigTable.

3. Hadoop opened the door of big data era.

Although the technology of Google is good, it does not open source. Without Doug Cutting and his Hadoop open source software, we would not see the rapid development of big data's technology and applications today. The Apache Nutch project led by Doug Cutting is the source of Hadoop software, which began in 2002 and is one of the sub-projects of Apache Lucene. At that time, the system architecture could not be extended to store and process networked data with billions of web pages. Google published the paper "The Google File System" describing its distributed file system on SOSP in 2003, which provided timely help to Nutch. In 2004, Nutch's distributed file system (NDFS) began to be developed. In the same year, Google published a paper entitled "MapReduce: Simplified Data Processing on Large Clusters" on OSDI. Inspired by Doug Cutting and others, they began to implement the MapReduce computing framework and combine it with NDFS (Nutch Distributed File System) to support the main algorithms of Nutch. By 2006, it has gradually become a complete and independent software, has reached Yahoo! The working Doug Cutting named the big data processing software Hadoop. In early 2008, Hadoop became the top project of Apache, except Yahoo! It has been applied in many Internet enterprises.

Early Hadoop, including Hadoop v1 and earlier, mainly consists of two core components: HDFS and MapReduce, where HDFS is the open source version of Google GFS, and the MapReduce computing framework implements the MapReduce programming model proposed by Google engineers. There are also some open source projects around Hadoop, which provide necessary support and supplement for improving the whole life cycle of big data's processing. These softwares commonly used are ZooKeeper, Hive, Pig, HBase, Storm, Kafka, Flume, Sqoop, Oozie, Mahout and so on. The alpha version of Hadoop v2 was released in May 2012, with the most important change being the addition of YARN (Yet Another Resource Negotiator) to the core components of Hadoop. The emergence of YARN is to completely separate the computing framework from resource management, and to solve the problems of poor scalability, single point of failure and inability to support multiple computing frameworks caused by Hadoop v1. The target of YARN happens to be Google's Borg system. At this point, Hadoop was able to compete with Google's big data platform.

A good, viable open source ecosystem should have a core that is differentiated and non-trivial, with a wide range of applications and an active community. Hadoop happens to have these three characteristics, big data open source ecosystem with Hadoop as the core has gradually formed, and Hadoop has become the most successful open source software since Linux. Entrusted by du Xiaoyong, dean of the School of Information, Renmin University, I organized a forum called "big data Open Source ecosystem" on CNCC 2015. The forum invited colleagues from Internet enterprises, hardware manufacturers, system integrators and academia to share their work and experience in big data open source. In the final Panel session, we discussed why and how to do open source. The answer is relatively scattered, there are open source is the only choice, there are those who open up the industrial chain, some think that open source is new business type's new business model, and some think that open source promotes technological progress. In a word, the motivation and goal of organizations to open source in different links of the industrial chain are naturally different, but only in this way, different roles in the industrial chain can find their place in the ecosystem. Such an ecosystem is robust and viable, isn't it?

4. The development history and application of Hadoop

The first person in big data's field to eat crabs is the Internet industry. This is because big data's concept and technology come from the Internet company's Big Brother Google. According to the practical application of Hadoop:

From 2006 to 2008 is the birth stage of Hadoop. Only a few foreign Internet giants are trying, and the domestic Internet industry is learning this new technology. 2006, Yahoo! Build a 100-node Hadoop cluster for Webmap services. 2007, Yahoo! Build a Hadoop cluster with 1000 nodes. 2008, Yahoo! The Hadoop cluster expanded to 2000 nodes, and Facebook contributed the Hive project to the open source community.

From 2008 to 2010 is the juvenile stage of Hadoop. Practical applications have been put into use in the Internet industry, which focus on web page storage and retrieval, log processing and user behavior analysis. 2009, Yahoo! Use a 4000-node cluster to run Hadoop to support advertising system and Web search research; Facebook uses a 600-node cluster to run Hadoop to store internal log data to support data analysis and machine learning; Baidu uses Hadoop to process weekly 200TB data, search log analysis and web page data mining. In 2010, Facebook's Hadoop cluster expanded to 1000 nodes; Baidu used Hadoop to process 1PB data every day; China Mobile Research Institute developed a "BigCloud" system based on Hadoop, which is not only used for related data analysis, but also provides services; and Taobao's Hadoop system reached 1000 units to store and process e-commerce transaction-related data.

From 2010 to 2015 is the youth stage of Hadoop. In the Internet industry, Hadoop is regarded as the standard configuration of big data computing, and the application forms tend to be diversified; in the field of enterprise computing, big data applications based on Hadoop began to be practiced; while pursuing big data's processing power, they also began to think about system adaptation and efficiency. A large number of data analysis applications have emerged in the Internet industry, such as Alipay's transaction data offline analysis system; using Hadoop and other software in the ecosystem to form more complex application systems, such as Tencent's Broad Point accurate advertising system, telecom operators' precision marketing system based on user portraits, and so on. In addition to the Internet industry, there have been network communications big data, financial big data, traffic and tourism big data, industrial manufacturing big data, medical and health big data, social governance big data, education big data, and so on. Big data's ideas and technologies have been integrated into various industries. Hadoop originates from the Internet industry and needs to be adapted when it is used in enterprise computing. The reason is that Internet applications and enterprise computing applications are essentially different in demand, service, R & D and operation and maintenance system. Internet application business logic is simple, serving a large number of users, non-fixed users, system user experience first, continuous delivery, professional operation and maintenance that can respond quickly; while enterprise computing application business logic is complex, a limited number of users, fixed users, the system emphasizes stability and reliability, version delivery, hierarchical technical support. For a time, many Hadoop distributions for enterprise users appeared on the market, which attracted the attention of enterprise users with the entry point of easy deployment, good configuration, and convenient use and management.

5. The Development trend of big data's Technology

Specialization of system architecture. From the point of view of the development of today's IT technology, the solution to the system structure is "application-driven big data architecture and technology". In other words, according to the needs of specific types of applications, innovation in the system architecture and key technologies. In order to reduce costs and achieve better energy efficiency, big data application system tends to be more and more flat, specialized system architecture and data processing technology, gradually getting rid of the traditional general technology system. For example, parallel database is more clearly divided into transaction-oriented OLTP database and analysis-oriented OLAP database and so on. The traditional three-tier architecture, such as application server, database server and storage server, has been greatly impacted. Application developers have a deeper understanding of computer system structure, "program" = "algorithm" + "data structure" will gradually evolve into "program" = "algorithm" + "data structure" + "system structure".

Big data expanded the scope of the ecosystem. Apache Hadoop, which cloned GFS and MapReduce of Google, has been gradually accepted by Internet companies since 2008, and has become the de facto standard in big data's processing field. But the emergence of Spark in 2013 as a dark horse can be said to put an end to this myth, big data technology is no longer a monopoly. Due to different applications, a set of Hadoop software system can not meet all the requirements. On the basis of full compatibility with Hadoop, Spark greatly improves system performance by making more use of memory processing. In addition, the emergence of Scribe, Flume, Kafka, Storm, Drill, Impala, TEZ/Stinger, Presto and Spark/Spark SQL is not to replace Hadoop, but to expand big data's technological ecological environment and promote the benign and complete development of the ecological environment. In the future, there will be more, better and more specialized software systems in non-volatile storage level, network communication level, volatile storage level and computing framework level.

Users pay more attention to the overall performance of the system. With the efforts of global Internet enterprises, Hadoop can already deal with 100 petabytes of data, and the data with low value density can be processed without considering the time dimension. After solving the problem that the traditional relational database technology can not handle this level of data, the industry is asking for the value of system energy efficiency. The problem of energy efficiency is reflected in the performance of the system on the one hand. Internet services emphasize user experience, and applications that cannot do real-time are moving closer to real-time. For example, the delay of front-end systems and business logs from generation to collection has evolved from 1 to 2 days to less than 10 seconds. Traditional enterprises can not stand the query and analysis performance of relational database for dozens of minutes, so they turn to better cost-effective technologies and products one after another. These requirements make big data interactive query analysis, streaming computing and memory computing become a new direction of research and development and application in the industry. Another aspect of energy efficiency is the power consumption and cost of the system. The dedicated neural network processor technology, led by Chen Yunyi, a researcher at the Institute of Computing of the Chinese Academy of Sciences, can greatly accelerate machine learning load. compared with general-purpose chips and GPU, the computing speed is increased by dozens of times, the power consumption is only 1/10, and the overall energy efficiency is improved by 450 times. Baidu Cloud Storage 10,000 customized ARM servers can save about 25% of power, increase storage density by 70%, increase computing capacity per watt by 34 times (using GPU instead of CPU computing), and reduce storage costs per GB by 50%.

The demand for personalized service is getting stronger. Personalization corresponds to the long tail part of Internet service, which is abandoned because of complexity in traditional system design, but it is this part that reflects the demand of personalized service. Personalized service, that is, the system can provide differentiated services to meet the needs of different individuals, such as personalized recommendation, accurate advertising and so on. Take personalized recommendation technology for example, it has begun to move from simple commodity recommendation to complex content recommendation. Personalized content recommendation services are provided to specific users according to the characteristics and preferences of users, the characteristics of recommended content, and the context data at that time (client device type, user space-time data, etc.). The content includes goods (including e-commerce and retail), advertising, news and information, etc. In the era of rapid development of mobile devices and mobile Internet, personalized recommendation will become one of the most direct channels for users to obtain information.

The theory and technology of value mining need to be developed urgently. The theory and technology of shallow analysis of data is mainly reflected in the combination and re-innovation of distributed system and relational database theory. However, there is a lack of theory and technology to extract implicit information or knowledge from data, that is, value mining. First, there is a lack of mature data mining modeling methods and tools, experience has a great impact on mining valuable information, and there is a technical deficiency between the original data and hidden information, so the case of "beer + diaper" does not occur every day. Second, machine learning and deep learning technology are faced with application problems. Combined with big data, it has been initially applied in scenarios such as speech recognition, image recognition, advertising recommendation and risk control, but the technology and software tools in this area are not mature and there is still a lot of room for improvement. In addition, the application scenarios of machine learning and deep learning are not wide enough, which is both an opportunity and a challenge.

6. Conclusion

Hadoop open source software has gone through ten years since 2006, which is a long life cycle for any software. However, Hadoop is also experiencing the impact from other open source dark horses. Spark draws on the mature ecosystem of Hadoop through full compatibility in the early stages of development. Today, Spark is challenging the authority of Hadoop, because Spark has set its development goal to replace Hadoop. Hadoop is old, can you still eat? Hadoop's nearly 100 Committer are actively planning the future for Hadoop, let's wait and see! We have entered the era of full coverage of data, social life, various industries are undergoing great changes because of data. In recent years, big data has become a basic strategic resource at the national level, which is increasingly having an important impact on global production, circulation, distribution, consumption activities, economic operation mechanism, social life style and national governance capacity. Promoting the development of big data has become the consensus of the international community.

At this point, on the "Hadoop cluster technology in recent years to promote the processing of big data," the study is over, I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.