Hadoop Learning Road (1)-- Hadoop Family Learning Roadmap 04/16 Update SLTechnology News&Howtos

Hadoop Learning Road (1)-- Hadoop Family Learning Roadmap

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This paper mainly introduces Hadoop family products, commonly used items include Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeeper, Avro, Ambari, Chukwa, newly added items include YARN, Hcatalog, Oozie, Cassandra, Hama, Whirr, Flume, Bigtop, Crunch, Hue and so on.

Since 2011, China has entered the era of big data, and the family software represented by Hadoop has occupied the vast territory dealt with by big data. Open source industry and manufacturers, all data software, are all close to Hadoop. Hadoop has also changed from a minority field of high wealth and handsome to the standard developed by big data. On the basis of the original Hadoop technology, the Hadoop family products have emerged. Through the continuous innovation of the concept of "big data", the scientific and technological progress has been introduced.

As developers in the IT world, we should also keep up with the pace, seize the opportunity, and rise with Hadoop!

Preface

Hadoop has been used for some time, from the beginning of confusion, to a variety of attempts, to the current combination of applications. . Slowly when it comes to data processing, it is inseparable from hadoop. The success of Hadoop in the field of big data has led to its own accelerated development. Now there are as many as 20 Hadoop family products.

It is necessary to sort out your knowledge and string up the products and technologies. It can not only deepen the impression, but also make basic preparations for the future technology direction and technology selection.

This article is the beginning of "Hadoop Family", the learning roadmap of Hadoop family.

Catalogue

Hadoop family products

Hadoop Family Learning Roadmap

1. Hadoop family products

As of 2013, according to cloudera statistics, the number of Hadoop family products has reached 20!

Http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/

Next, I divided these 20 products into two categories.

The first category is what I have already mastered.

The second category is that TODO is going to continue to learn.

In a word, product introduction:

Apache Hadoop: a distributed computing open source framework of the Apache open source organization, it provides a distributed file system subproject (HDFS) and a software architecture that supports MapReduce distributed computing.

Apache Hive: is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and quickly realize simple MapReduce statistics through SQL-like statements, without the need to develop special MapReduce applications, so it is very suitable for statistical analysis of data warehouse.

Apache Pig: a large-scale data analysis tool based on Hadoop, it provides a SQL-LIKE language called Pig Latin. The compiler of this language converts SQL-like data analysis requests into a series of optimized MapReduce operations.

Apache HBase: a distributed storage system with high reliability, high performance, column-oriented and scalable. Large-scale structured storage cluster can be built on cheap PC Server by using HBase technology.

Apache Sqoop: is a tool for transferring data from Hadoop and relational databases to each other. Data from a relational database (MySQL, Oracle, Postgres, etc.) can be imported into Hadoop's HDFS, and HDFS data can also be imported into relational databases.

Apache Zookeeper: a distributed, open source coordination service designed for distributed applications. It is mainly used to solve some data management problems often encountered in distributed applications, simplify the difficulty of distributed application coordination and management, and provide high-performance distributed services.

Apache Mahout: a distributed framework for machine learning and data mining based on Hadoop. Mahout implements some data mining algorithms with MapReduce, which solves the problem of parallel mining.

Apache Cassandra: is an open source distributed NoSQL database system. Originally developed by Facebook, it is used to store data in a simple format and integrates Google BigTable's data model with Amazon Dynamo's fully distributed architecture.

Apache Avro: a data serialization system designed to support data-intensive, mass data exchange applications. Avro is a new data serialization format and transmission tool, which will gradually replace the original IPC mechanism of Hadoop.

Apache Ambari: a Web-based tool that supports provisioning, management, and monitoring of Hadoop clusters.

Apache Chukwa: an open source data collection system for monitoring large distributed systems, it can collect various types of data into files suitable for Hadoop processing and save them in HDFS for Hadoop to do various MapReduce operations.

Apache Hama: a HDFS-based BSP (Bulk Synchronous Parallel) parallel computing framework, Hama can be used for large-scale, big data computing, including graphs, matrices, and network algorithms.

Apache Flume: is a distributed, reliable, highly available mass log aggregation system, which can be used for log data collection, log data processing, log data transmission.

Apache Giraph: a scalable, distributed iterative graph processing system based on the Hadoop platform, inspired by BSP (bulk synchronous parallel) and Google's Pregel.

Apache Oozie: a workflow engine server for managing and coordinating tasks running on the Hadoop platform (HDFS, Pig, and MapReduce).

Apache Crunch: a Java library based on Google's FlumeJava library for creating MapReduce programs. Similar to Hive,Pig, Crunch provides a pattern library for common tasks such as joining data, performing aggregation, and sorting records

Apache Whirr: a set of class libraries (including Hadoop) that run on cloud services that provide a high degree of complementarity. Whirr Learning supports Amazon EC2 and Rackspace services.

Apache Bigtop: a tool for packaging, distributing and testing Hadoop and its surrounding ecology.

Apache HCatalog: is based on Hadoop data table and storage management, to achieve central metadata and schema management, across Hadoop and RDBMS, using Pig and Hive to provide relational views.

Cloudera Hue: is a WEB-based monitoring and management system to achieve HDFS,MapReduce/YARN, HBase, Hive, Pig web operation and management.

2. Hadoop Family Learning Roadmap

Below, I will introduce the installation and use of each product, and summarize my learning route based on my experience.

Hadoop

Hadoop learning roadmap

Yarn learning roadmap

Using Maven to build Hadoop Project

Installation of Hadoop Historical version

Hadoop programmatically calls HDFS

Extraction of KPI Statistical Indexes by Hadoop in massive Web Log Analysis

Constructing Movie recommendation system with Hadoop

Create a Hadoop parent virtual machine

Clone virtual machine and add Hadoop node

R language injects statistical blood into Hadoop

Building Hadoop environment, one of the series of RHadoop practices

Hive

Hive learning roadmap

Introduction to installation and use of Hive

Test of importing 10G data into Hive

Hive of R Lijian NoSQL series of articles

Using RHive to extract reverse repurchase information from historical data

Pig

Pig learning roadmap

Zookeeper

Zookeeper learning roadmap

Installation and use of ZooKeeper pseudo-step-by-step Cluster

Implementation of distributed queue Queue with ZooKeeper

Implementing distributed FIFO queues with ZooKeeper

HBase

HBase learning roadmap

Installation and use of rhbase, part 4 of RHadoop practice series

Mahout

Mahout learning roadmap

Parsing Mahout user recommendation Collaborative filtering algorithm (UserCF) with R

Collaborative filtering algorithm for MapReduce implemented by R, the third of RHadoop practice series

Using Maven to build Mahout Project

Detailed explanation of Mahout recommendation algorithm API

Analysis of Mahout recommendation engine from source code

Mahout step-by-step Program Development of item-based Collaborative filtering ItemCF

Mahout step-by-step Program Development Cluster Kmeans

Building a Job recommendation engine with Mahout

Sqoop

Sqoop learning roadmap

Cassandra

Cassandra learning roadmap

Cassandra single cluster experiment 2 nodes

Cassandra of R Lijian NoSQL series of articles

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.