In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
This paper mainly introduces Hadoop family products, commonly used items include Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeeper, Avro, Ambari, Chukwa, newly added items include YARN, Hcatalog, Oozie, Cassandra, Hama, Whirr, Flume, Bigtop, Crunch, Hue and so on.
Since 2011, China has entered the era of big data, and the family software represented by Hadoop has occupied the vast territory dealt with by big data. Open source industry and manufacturers, all data software, are all close to Hadoop. Hadoop has also changed from a minority field of high wealth and handsome to the standard developed by big data. On the basis of the original Hadoop technology, the Hadoop family products have emerged. Through the continuous innovation of the concept of "big data", the scientific and technological progress has been introduced.
As developers in the IT world, we should also keep up with the pace, seize the opportunity, and rise with Hadoop!
Preface
Hadoop has been used for some time, from the beginning of confusion, to a variety of attempts, to the current combination of applications. . Slowly when it comes to data processing, it is inseparable from hadoop. The success of Hadoop in the field of big data has led to its own accelerated development. Now there are as many as 20 Hadoop family products.
It is necessary to sort out your knowledge and string up the products and technologies. It can not only deepen the impression, but also make basic preparations for the future technology direction and technology selection.
This article is the beginning of "Hadoop Family", the learning roadmap of Hadoop family.
Catalogue
Hadoop family products
Hadoop Family Learning Roadmap
1. Hadoop family products
As of 2013, according to cloudera statistics, the number of Hadoop family products has reached 20!
Http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/
Next, I divided these 20 products into two categories.
The first category is what I have already mastered.
The second category is that TODO is going to continue to learn.
In a word, product introduction:
Apache Hadoop: a distributed computing open source framework of the Apache open source organization, it provides a distributed file system subproject (HDFS) and a software architecture that supports MapReduce distributed computing.
Apache Hive: is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and quickly realize simple MapReduce statistics through SQL-like statements, without the need to develop special MapReduce applications, so it is very suitable for statistical analysis of data warehouse.
Apache Pig: a large-scale data analysis tool based on Hadoop, it provides a SQL-LIKE language called Pig Latin. The compiler of this language converts SQL-like data analysis requests into a series of optimized MapReduce operations.
Apache HBase: a distributed storage system with high reliability, high performance, column-oriented and scalable. Large-scale structured storage cluster can be built on cheap PC Server by using HBase technology.
Apache Sqoop: is a tool for transferring data from Hadoop and relational databases to each other. Data from a relational database (MySQL, Oracle, Postgres, etc.) can be imported into Hadoop's HDFS, and HDFS data can also be imported into relational databases.
Apache Zookeeper: a distributed, open source coordination service designed for distributed applications. It is mainly used to solve some data management problems often encountered in distributed applications, simplify the difficulty of distributed application coordination and management, and provide high-performance distributed services.
Apache Mahout: a distributed framework for machine learning and data mining based on Hadoop. Mahout implements some data mining algorithms with MapReduce, which solves the problem of parallel mining.
Apache Cassandra: is an open source distributed NoSQL database system. Originally developed by Facebook, it is used to store data in a simple format and integrates Google BigTable's data model with Amazon Dynamo's fully distributed architecture.
Apache Avro: a data serialization system designed to support data-intensive, mass data exchange applications. Avro is a new data serialization format and transmission tool, which will gradually replace the original IPC mechanism of Hadoop.
Apache Ambari: a Web-based tool that supports provisioning, management, and monitoring of Hadoop clusters.
Apache Chukwa: an open source data collection system for monitoring large distributed systems, it can collect various types of data into files suitable for Hadoop processing and save them in HDFS for Hadoop to do various MapReduce operations.
Apache Hama: a HDFS-based BSP (Bulk Synchronous Parallel) parallel computing framework, Hama can be used for large-scale, big data computing, including graphs, matrices, and network algorithms.
Apache Flume: is a distributed, reliable, highly available mass log aggregation system, which can be used for log data collection, log data processing, log data transmission.
Apache Giraph: a scalable, distributed iterative graph processing system based on the Hadoop platform, inspired by BSP (bulk synchronous parallel) and Google's Pregel.
Apache Oozie: a workflow engine server for managing and coordinating tasks running on the Hadoop platform (HDFS, Pig, and MapReduce).
Apache Crunch: a Java library based on Google's FlumeJava library for creating MapReduce programs. Similar to Hive,Pig, Crunch provides a pattern library for common tasks such as joining data, performing aggregation, and sorting records
Apache Whirr: a set of class libraries (including Hadoop) that run on cloud services that provide a high degree of complementarity. Whirr Learning supports Amazon EC2 and Rackspace services.
Apache Bigtop: a tool for packaging, distributing and testing Hadoop and its surrounding ecology.
Apache HCatalog: is based on Hadoop data table and storage management, to achieve central metadata and schema management, across Hadoop and RDBMS, using Pig and Hive to provide relational views.
Cloudera Hue: is a WEB-based monitoring and management system to achieve HDFS,MapReduce/YARN, HBase, Hive, Pig web operation and management.
2. Hadoop Family Learning Roadmap
Below, I will introduce the installation and use of each product, and summarize my learning route based on my experience.
Hadoop
Hadoop learning roadmap
Yarn learning roadmap
Using Maven to build Hadoop Project
Installation of Hadoop Historical version
Hadoop programmatically calls HDFS
Extraction of KPI Statistical Indexes by Hadoop in massive Web Log Analysis
Constructing Movie recommendation system with Hadoop
Create a Hadoop parent virtual machine
Clone virtual machine and add Hadoop node
R language injects statistical blood into Hadoop
Building Hadoop environment, one of the series of RHadoop practices
Hive
Hive learning roadmap
Introduction to installation and use of Hive
Test of importing 10G data into Hive
Hive of R Lijian NoSQL series of articles
Using RHive to extract reverse repurchase information from historical data
Pig
Pig learning roadmap
Zookeeper
Zookeeper learning roadmap
Installation and use of ZooKeeper pseudo-step-by-step Cluster
Implementation of distributed queue Queue with ZooKeeper
Implementing distributed FIFO queues with ZooKeeper
HBase
HBase learning roadmap
Installation and use of rhbase, part 4 of RHadoop practice series
Mahout
Mahout learning roadmap
Parsing Mahout user recommendation Collaborative filtering algorithm (UserCF) with R
Collaborative filtering algorithm for MapReduce implemented by R, the third of RHadoop practice series
Using Maven to build Mahout Project
Detailed explanation of Mahout recommendation algorithm API
Analysis of Mahout recommendation engine from source code
Mahout step-by-step Program Development of item-based Collaborative filtering ItemCF
Mahout step-by-step Program Development Cluster Kmeans
Building a Job recommendation engine with Mahout
Sqoop
Sqoop learning roadmap
Cassandra
Cassandra learning roadmap
Cassandra single cluster experiment 2 nodes
Cassandra of R Lijian NoSQL series of articles
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.