Hadoop big data develops learning roadmap stage 1 04/27 Update SLTechnology News&Howtos

Hadoop big data develops learning roadmap stage 1

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Hadoop has developed to a very rich family of products today, able to meet the big data processing needs of different scenarios. As the mainstream big data processing technology, many companies in the market are based on Hadoop, and have very mature solutions for many scenarios.

Mastering Hadoop and its ecosystem as a developer is the only way to enter the big data world.

Here is a detailed roadmap for learning Hadoop development technology.

Hadoop itself is developed in Java, so support for Java is very good, but other languages can also be used.

The following technical route focuses on data mining, because Python is more efficient to develop, so we use Python for tasks.

Because Hadoop runs on Linux, you also need to know Linux.

Phase 1: Hadoop Ecosystem Technology

language Foundation

Java: Mastering javase knowledge, understanding and practicing memory management in Java virtual machines, as well as multithreading, thread pools, design patterns, parallelization can be done without in-depth mastery.

Linux: System installation (command-line interface and graphical interface), basic commands, network configuration, Vim editor, process management, Shell scripts, familiarity with virtual machine menus, etc.

Python: Basic syntax, data structures, functions, conditional judgments, loops and other basics.

Environmental preparation

Here's how to build a fully distributed Windows PC, 1 master and 2 slaves.

VMware virtual machines, Linux systems (Centos 6.5), Hadoop installation packages, and here a fully distributed Hadoop cluster environment is prepared.

MapReduce

MapReduce is a distributed offline computing framework that is the core programming model for Hadoop. It is mainly suitable for large-scale cluster tasks. Because it is executed in batches, the timeliness is low.

HDFS1.0/2.0

Hadoop Distributed File System (HDFS) is a highly fault-tolerant system suitable for deployment on inexpensive machines. HDFS provides high-throughput data access and is ideal for applications on large-scale datasets.

Yarn（Hadoop2.0）

Yarn is a resource scheduling platform that is mainly responsible for allocating resources to tasks. Yarn is a common resource scheduling platform, and all frameworks that meet the conditions can use Yarn for resource scheduling.

Hive

Hive is a data warehouse where all data is stored on HDFS. Hive is mainly used to write Hql, which is very similar to MySQL database Sql. In fact, Hive is executing Hql, and the underlying MapRedce program is still executing when it is executed.

Spark

Spark is a fast, general-purpose computation engine designed for large-scale data processing, which is memory-based iterative computation. Spark retains the advantages of MapReduce and has greatly improved timeliness.

Spark Streaming

Spark Streaming is a real-time processing framework where data is processed batch by batch.

Spark Hive

Fast Sql retrieval based on Spark. Spark, as Hive's computation engine, submits Hive queries as Spark tasks to Spark clusters for computation, which can improve Hive query performance.

Storm

Storm is a real-time computing framework. The difference between MR and Storm is that MR processes offline massive data, while Storm processes each new piece of data in real time, one by one, to ensure the timeliness of data processing.

Zookeeper

Zookeeper is the foundation of many big data frameworks, and it is the manager of clusters. Monitor the status of each node in the cluster and perform the next reasonable operation according to the feedback submitted by the node.

Finally, simple and easy-to-use interfaces and efficient, stable systems are provided to users.

Hbase

Hbase is a Nosql database, a Key-Value database, a highly reliable, column-oriented, scalable, distributed database.

For unstructured data storage, the underlying data is stored on HDFS.

Kafka

Kafka is a messaging middleware commonly used in real-time processing scenarios at work as an intermediate buffer layer.

Flume

Flume is a log collection tool. It is common to collect data from log files generated by applications. Generally, there are two processes.

One is Flume collecting data and storing it in Kafka, which is convenient for Storm or SparkStreaming to process in real time.

Another process is to store the data collected by Flume on HDFS for later offline processing using hadoop or spark.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.