Hadoop framework 04/27 Update SLTechnology News&Howtos

Hadoop framework

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Big data handles Architecture Hadoop

1. Overview

1.1 introduction to Hadoop

Hadoop is an open source distributed computing platform under the apache Software Foundation, which provides users with a transparent distributed infrastructure at the bottom of the system. Hadoop is based on java language.

Development. It has good cross-platform performance and can be deployed in cheap computer clusters. The core of hadoop is HDFS (distributed file system, which solves the storage of massive data) and mapreduce.

It solves the problem of dealing with massive data. Hadoop is recognized as the industry big data standard open source software, which provides massive data segment processing capacity in a distributed environment.

In April 2008, hadoop broke the world record at that time to become the fastest system to sort 1TB data, using a cluster of 901 nodes with a sorting time of only 209s.

1.2A brief history of Hadoop

Hadoop is a text index library originally developed by Doug Cutting, the founder of the apache Lucene project. In 2002, the Nutch project (part of the Lucene project) encountered

The framework cannot be extended to networks with billions of web pages. In 2003, Google published a paper on distributed file system GFS, and in 2004 the Nutch project imitated the development of GFS.

NDFS (the predecessor of HDFS). In 2004, Google published a paper on the idea of MapReduce distributed programming. In 2005, the Nutch project opened up Google's MapReduce.

So far. Hadoop's two core HDFS and MapReduce, influenced by Google's paper, have made hadoop a leader in the spiritual domain of massive data processing. January 2008

Hadoop has officially become a top-level project for apache.

1.3 hadoop characteristics

High availability: redundant storage

High efficiency: using distributed storage and distributed processing, funny processing of PB-level data

High scalability: hadoop is designed to run efficiently and stably on cheap computer clusters.

High fault tolerance: redundant data storage is adopted.

Low cost: using cheap computer clusters.

Runs on the linux platform.

Multiple programming languages are supported.

1.4 version of hadoop

The apache hadoop version is divided into two generations. The first generation hadoop contains 0.20x, 0.21.x and 0.22.x versions. Among them, 0.20x finally evolved into 1.0.x, becoming a stable version.

0.21.x and 0.22.x add important features such as HDFS HA (High availability). The second generation of hadoop includes 0.23.x and 2.x versions.

The second generation hadoop has made a great improvement over the first generation hadoop: it mainly splits the functions of MapReduce and lightens the load of the system. Added YARN resource scheduling framework

Mapreduce runs on the YARN framework (responsible for the scheduling of system resources, and is no longer responsible for the scheduling of system resources, but only focuses on distributed computing. As another core of hadoop

HDFS also adds federation and HA (high availability and hot backup) (namenode in HDFS requires high availability)

Comparison of the various versions of apache:

In addition to the free and open source apache hadoop version that provides standards, a number of commercial companies are dropping out of the hadoop version.

Cloudera, which entered Shanghai in 2014, is partially open source with synchronized functions of apache hadoop, with self-developed products, impala and navigator.

The synchronization of hortonwork and apache functions is also completely open source (the biggest contributor to the apache hadoop platform. Product: Tez architecture, next-generation hadoop query processing framework)

MapR has modified and optimized a lot on the basis of apache hadoop, and has formed its own product.

Domestic ones have star rings. The core components are synchronized with apache, there are many bottom layers, and the source is completely closed, as well as our own hadoop products Inceptor, Hyperbase.

1.41 the application architecture of hadoop in the enterprise is divided into three layers: the flow direction of the data source is the data source = "big data layer =" access layer.

The big data layer is based on HDFS distributed storage; it is divided into three parts. It provides operations for the three functions of the access layer: data analysis, real-time data query and data mining.

Among them, mapreduce (hive, pig) of big data layer provides offline data analysis.

Hbase (solr redis) provides real-time query of data.

Mahout BI Analysis to complete data Mining

1.5 hadoop ecosystem

1.51 HDFS

Distributed File system (Hadoop Distributed File System HDFS) is the core of the Hadoop project. Open source for Google's file system GFS.

It has the advantages of handling super-large data, streaming running on cheap servers and so on. At the beginning of the design, consider the hardware failure as normal to ensure that when part of the hardware fails

It can still guarantee the overall availability and reliability. In addition, HDFS relaxes POSIX (Portable operating system Interface) to access system data in the form of streams. Improve the system's

Throughput.

1.52 HBase

Open source implementation for Google BigTable. Has a strong ability to store unstructured data. HDFS is used as the underlying data storage, compared with the column-based storage in traditional databases.

In a different way, HBase is row-based storage. It can be scaled out. Is a column database that provides high reliability, high performance, scalable, real-time read and write, and distributed database.

1.53 MapReduce

Open source implementation of MapReduce for Google. It's a programming model. Highly abstract to two functions, Map and Reduce. Develop without knowing the underlying details

Parallel applications run in low-cost computer clusters to process massive data (greater than 1T). The core idea is to divide the data into several independent data blocks. Distribute to a node

All the demarcation points under the management are completed in parallel, and finally the intermediate results of each node are integrated to get the final result.

1.54 Hive

Data warehouse tools based on hadoop. Data collation, special query, analysis and storage are carried out for the data sets in the hadoop file. Using Hive-QL in a SQL-like language

Fast implementation of simple MapReduce statistics.

1.55 Pig

Simplifies the common tasks of Hadoop. Provides an interface closer to SQL for hadoop programs. Search for a given search condition in a large dataset

When recording, Pig only needs to write a simple script to automatically process and distribute in parallel in the cluster, while MapReduce needs to write a separate program.

1.56 Mahout

Provide scalable classical algorithms in machine learning field. Used for data mining.

1.57 Zookeeper

An open source implementation of Google Chubby is an efficient and reliable collaborative work system, which is used to build distributed applications and reduce the coordination tasks undertaken by distributed applications.

1.58 Flume

Is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system provided by Cloudera. Support customizing all kinds of data senders in log system

Used to collect data; at the same time, Flume provides the ability to simply process the data and write it to various data recipients.

1.59 Sqoop

The abbreviation of SQL-to-Hadoop. It is used to exchange data between Hadoop and relational database. Through Sqoop, the data can be transferred from mysql, Oracle, postgresql, etc.

Import Hadoop into type data (you can import HDFS, HBase, and Hive) or you can export data from hadoop to a relational database. It is convenient for data migration.

1.510 Ambari

Web tool. Support the installation, deployment, configuration and management of Apache Hadoop clusters

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.