Big data's introduction to hadoop: a detailed explanation of the hadoop family 04/25 Update SLTechnology News&Howtos

Big data's introduction to hadoop: a detailed explanation of the hadoop family

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Big data may sound strange to you a few years ago, but I'm sure you will feel "familiar" when you hear the word hadoop now! More and more people around me are engaged in hadoop development or learning hadoop. As an entry-level novice to hadoop, where do you find it difficult? I'm afraid the construction of the operating environment is enough to give beginners a headache. If every distribution hadoop can integrate all kinds of environments like Daxi DKHadoop, and install them all at once, it will be a wonderful thing for beginners!

The gossip is a little too much and goes back to the whole. This article is going to share some basic knowledge of hadoop-- hadoop family products-- for friends who are new to hadoop. Through the understanding of hadoop family products, we can further help you learn hadoop! At the same time, you are also welcome to put forward your valuable suggestions!

I. definition of Hadoop

Hadoop is a large family, an open source ecosystem, a distributed operating system, and an architecture based on the Java programming language. However, its smartest technologies are HDFS and MapReduce, which enable it to process large amounts of data distributed.

II. Hadoop products

HDFS (distributed file system):

It is different from the existing file system with many features, such as a high degree of fault tolerance (even if there is an error halfway, it can continue to run), support multimedia data and streaming media data access, efficient access to large data sets, data to maintain rigorous consistency, deployment costs are reduced, deployment efficiency is improved, as shown in the figure is the infrastructure of HDFS.

MapReduce/Spark/Storm (parallel Computing Architecture):

1. In terms of data processing, separation line calculation and on-line calculation:

Role

Description

MapReduce

MapReduce is often used for offline complex big data calculations.

Storm

Storm is used for online real-time big data calculation, and Storm real-time is mainly a piece of data processing.

Spark

Can be used for offline or online real-time big data calculation, Spark real-time is mainly to deal with a time region of data, so Spark is more flexible.

2. Data storage location is divided into disk computing and memory computing:

Role

Description

MapReduce

The data is stored on disk

Spark and Strom

The data is stored in memory

Pig/Hive (Hadoop programming):

Role

Description

Pig

Is a high-level programming language with very high performance in dealing with semi-structured data, which can help us shorten the development cycle.

Hive

Is a data analysis query tool, especially when using SQL-like query analysis to show extremely high performance. You can do what ETL takes one night to do in minutes, and that's the advantage. Take the lead!

HBase/Sqoop/Flume (data Import and Export):

Role

Description

HBase

Is a column storage database running on the HDFS schema and has been well integrated with Pig/Hive. HBase can be used almost seamlessly with Java API.

Sqoop

It is designed to facilitate the import of data from traditional databases into Hadoop data sets (HDFS/Hive).

Flume

It is designed to easily import data directly from the journaling file system into the Hadoop data set (HDFS).

These data transfer tools are greatly convenient for users, improve work efficiency, and focus on business analysis.

ZooKeeper/Oozie (system Management Architecture):

Role

Description

ZooKeeper

Is a system management coordination architecture for managing the basic configuration of a distributed architecture. It provides many interfaces to simplify configuration management tasks.

Oozie

Oozie services are used to manage workflows. It is used to schedule different workflows so that each work has a beginning and an end. These architectures help us to manage big data's distributed computing architecture lightweight.

Ambari/Whirr (system deployment Management):

Role

Description

Ambari

Help relevant personnel to quickly deploy and build the entire big data analysis framework, and monitor the operation of the system in real time.

Whirr

The main role of Whirr is to facilitate the rapid development of cloud computing.

Mahout (Machine Learning):

Mahout is designed to help us quickly complete high-IQ systems. Some of the logic of machine learning has been realized. This architecture allows us to quickly integrate more machine learning intelligence.

People like to pay more attention, and your attention is my biggest motivation.

Those who need big data's information can trust me privately.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.