3 minutes for you to understand what Hadoop has done. 07/02 Update SLTechnology News&Howtos

3 minutes for you to understand what Hadoop has done.

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The reason for writing this article is to show those who read a lot of practical information, but think that too much, for the beginner students seem to be more boring things, then the author summed up the concise version of hadoop. Hope to be able to help beginners. Before sharing, I still want to recommend big data Learning Exchange Qun: 710219868 to enter the Qun chat invitation code and fill in Nanfeng (required) and I will know it is you.

Hadoop can be said to be the founder of big data's storage and computing, and now most open source frameworks rely on Hadoop, or are more compatible with it.

The origin of Hadoop:

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Hadoop implements a distributed file system (Hadoop Distributed File System), referred to as HDFS. HDFS has high fault tolerance and is designed to be deployed on low-cost (low-cost) hardware; and it provides high throughput (high throughput) to access application data, suitable for applications with very large data sets (large data set). HDFS relaxes the requirement of (relax) POSIX to access (streaming access) data in a file system as a stream.

There are two cores of hadoop: HDFS (providing storage for massive data) and MapReduce (providing computing for massive data).

The advantages of hadoop: it is a distributed software framework for big data, and it is a reliable, efficient and highly scalable way to process data.

Where reliable: because it assumes computing elements or storage failures, it maintains multiple working data and copies to ensure that it can redistribute processing for failed nodes

Where efficient: because it works in parallel, it speeds up processing through parallel processing

It's scalable because it can handle PB-level data.

So with so much practical information, what on earth does hadoop do? What exactly can be done?

Hadoop is an application suitable for big data storage and big data analysis, suitable for cluster operation of thousands or tens of thousands of servers, and supports PB-level storage capacity. What functions does hadoop provide? The use of server clusters, according to user-defined business logic for distributed processing of massive data! What scenarios does hadoop apply to? At present, the most typical is the text, log, video, picture and geographical location which are used in large amount of data and complex data types, which can not be stored and processed by traditional database.

Technical introduction:

HDFS:

Well, as the name implies, big data first wants to store the data. HDFS is designed to store large amounts of data across thousands of servers.

For example, if you get the data of / hdfs/tmp/a1, although you only see the data of one path, it is likely that the data is stored on many different machines.

As a user, you simply don't care where and how many of your data are stored, and pay more attention to the use and processing of your data, which should be managed by HDFS.

MapReduce:

So when we are able to store data, we have to consider how to deal with the data. It may take several days for a computer to process data above T or P, which is obviously unacceptable to the company. But if we use a lot of computers to process, we are faced with the task of how to distribute between computers, how to communicate and exchange data. This is what MapReduce/Spack has to deal with. It provides a reliable computing model that can run on the cluster.

Hive:

To put it simply, programmers find it troublesome when writing MapReduce programs, but they can solve this problem through Hive.

Hive automatically translates scripts or SQL into MapReduce programs through SQL, and then throws them to the computing engine to deal with because SQL is relatively easy to use and easy to modify, maybe one or two lines of SQL statements may be replaced by dozens of lines of MapReduce, hundreds of lines are introduced above is the basic structure of the data warehouse, the bottom is HDFS, the above is running MapReduce/Spark, and what is encapsulated above is Hive.

Storm:

Want faster computing speed! Storm is the most popular streaming computing platform. The idea of stream processing is to process the data when it enters the system, basically without delay. The disadvantage is inflexibility, must be in advance until the need for statistical data, the data flow will be gone, can not be recalculated. So it's a good thing, but it's no substitute for the above system.

HBase:

HBase is a distributed, column-oriented storage system built with HDFS. The data is stored in the way of key value pairs and the access operation is optimized so that the bound data can be quickly obtained according to key. For example, it only takes a few tenths of a second to find the × × number from the data of several Ps.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.