What are the characteristics of Hadoop 05/05 Update SLTechnology News&Howtos

What are the characteristics of Hadoop

2025-05-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "what are the characteristics of Hadoop". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The reason for the emergence of Hadoop: now we are living in the age of the Big Bang. IDC has predicted that the total amount of global data will reach 44ZB in 2020, at least 44 billion TB after unit conversion, that is to say, each person in the world cannot store a 1TB hard disk.

Some datasets are much larger than 1TB, that is, data storage is a problem to be solved. At the same time, the hard disk technology also faces a technical bottleneck, that is, the improvement of the transmission speed of the hard disk (the speed of reading data) is much lower than that of the hard disk capacity.

The capacity has been increased by nearly 1000 times, while the transmission speed has only increased by 20 times, and the time it takes to read a hard drive is relatively longer (in violation of the immediacy of data value). It takes so long to read the data, let alone write it.

For how to improve the efficiency of reading data, we have come up with a solution, that is, a data set is stored in multiple hard drives, and then read in parallel. For example, for 1 T of data, we store an average of 100 copies on 100 1TB hard drives and read them at the same time, so it takes less than two minutes to read the entire data set. As for the remaining 99% of the capacity of the hard drive, we can use it to store other data sets so that there is no waste. While solving the problem of reading efficiency, we also solved the storage problem of big data.

However, when we read / write to multiple hard drives at the same time, there are new problems that need to be solved:

1. Hardware failure. Once multiple hardware is used, individual hardware is relatively likely to fail. To avoid data loss, the most common way is to replication: the file system holds multiple copies of the data, and in the event of a failure, another copy can be used.

2. The correctness of reading data. An analysis task in big data's era needed to combine most of the data to complete the analysis, so the data read from one hard disk should be combined with the data read from the other 99 hard drives. Then, in the process of reading, how to ensure the correctness of the data is a great challenge.

Some people may wonder, since multiple hard drives are used, why not use a relational database with multiple hard drives for data storage and analysis? In fact, this mainly depends on a technical limitation of hard disk development, that is, the need for addressing operations. Reading data from a relational database contains a large number of addressing operations, then the time overhead caused by addressing is bound to be greatly increased, coupled with the time to read the data, it will be even longer. Another reason is that relational databases are not suitable for storing semi-structured and unstructured data, which accounts for 90% of semi-structured and unstructured data, while structured data accounts for only 10%.

For the above problems, Hadoop provides us with a reliable and scalable storage and analysis platform. In addition, because Hadoop runs on commercial hardware and is open source, the cost of using Hadoop is relatively low, within the affordability of users.

Second, a brief introduction to Hadoop

Hadoop is an open source distributed computing platform under the Apache Foundation, which is developed based on the Java language. It has good cross-platform features and can be deployed in cheap computer clusters. Users can develop distributed programs without knowing the underlying details of the distribution, and make full use of the power of the cluster for high-speed operation and storage.

Initially, the core technologies of Hadoop were HDFS and MapReduce.

HDFS is the abbreviation of Hadoop distributed File system (Hadoop Distributed File System). It has high reading and writing speed, good fault tolerance and scalability, provides distributed storage for massive data, and its redundant data storage ensures the security of data.

MapReduce is a software framework (programming model) for parallel processing of big data sets. Users can write MapReduce programs to analyze and process the data on the distributed file system without knowing the underlying details. MapReduce ensures the high efficiency of analyzing and processing data.

Because of its excellent processing capacity of efficient and massive data in a distributed environment, Hadoop is recognized as the standard open source software in big data industry. Almost all the mainstream manufacturers, such as Google, Yahoo, Microsoft, Taobao and so on, provide development tools, open source software, commercial tools or technical services around Hadoop.

After Hadoop2.0, another core technology has been introduced: YARN (Yet Another Resource Negotiator). It is a task scheduling and cluster resource management system. There are mainly two types of long-running daemon threads to provide its own core services: one is the resource manager (Resouce Manager) used to manage the resources on the cluster, and the other is the node manager (Node Manager) that runs on each node in the cluster and can start and monitor the container (container).

The current Hadoop3.x can be summarized into the following five modules:

Hadoop Common: renamed from the original Hadoop core. In previous versions, core included HDFS, MapReduce, and other public parts. Later, HDFS and MapReuce were separated as separate subprojects, and the rest of the public part was renamed Common. It mainly includes system configuration tool Configuration, remote procedure call RPC, serialization mechanism and Hadoop abstract file system FileSystem and so on. They provide basic services for building a cloud computing environment on general hardware and provide the required API for software development running on the platform.

One of the core technologies of Hadoop HDFS:Hadoop, distributed file system

A new core technology after Hadoop YARN:Hadoop2.0, resource management system

One of the core technologies of Hadoop MapReduce:Hadoop, programming model, is used for parallel computing of large-scale data sets.

An extension direction of Hadoop Ozone:HDFS, object storage technology.

On the origin of the name Hadoop, Doug Cutting, the founder of the project, explained: "my child named this name for a brown elephant toy. My naming standard is short, easy to pronounce and spell, doesn't have much meaning, and won't be used anywhere else. Kids are good at this."

The pronunciation of Hadoop is [h æ du:p].

III. History of Apache Hadoop

Hadoop was founded by doug Cutting, the founder of the well-known project Apache Lucene.

It originated from the Apache Nutch project (a web crawling tool and search engine system, which later encountered the problem of storing large amounts of data on web pages).

In 2003, Google published a paper describing the "Google distributed File system," or GFS, that inspired the developers of the Apache Nutch project.

In 2004, Nutch developers started working on NDFS, Nutch's distributed file system.

In 2004, Google published another paper introducing MapReduce.

In 2005, the Nutch project implemented a MapReduce system.

In 2006, developers moved NDFS and MapReduce out of the Nutch project to form a subproject, named Hadoop

In 2008, Hadoop has been called the top-level project of Apache.

In April 2008, Hadoop broke the world record and became the fastest system to sort 1TB data, with a sorting time of 209 seconds

In 2009, Hadoop reduced the sorting time for 1TB data to 62 seconds.

Since then, it has gained great fame, and now many companies are using it, such as Yahoo, last.fm,FaceBook, New York Times and so on.

Current version development of Hadoop: hadoop1.x > hadoop2.x > hadoop3.x

IV. Characteristics of Hadoop

Because Hadoop is developed based on the Java language, the ideal running platform is the Linux system. It also supports a variety of programming languages, such as Clippers, philosophy, PHP, etc.

The following advantages can also be summarized:

High reliability. The ability of Hadoop to store and process data bit by bit is trustworthy.

High efficiency. Hadoop can move data dynamically between nodes and ensure the dynamic balance of each node, so the processing speed is very fast and can deal with PB-level data.

High scalability. The design goal of Hadoop is to run efficiently and stably on cheap computer clusters, which can be extended to thousands of computer nodes.

High fault tolerance. Using redundant data storage mode, multiple copies of data are automatically saved, and failed tasks can be automatically reassigned.

The cost is low. Hadoop uses cheap computer clusters, the cost is relatively low, ordinary users can also use their own computers to build Hadoop environment

This is the end of the content of "what are the characteristics of Hadoop". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.