Chapter 1 of the introduction to Apache Hadoop tutorial 04/22 Update SLTechnology News&Howtos

Chapter 1 of the introduction to Apache Hadoop tutorial

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Apache Hadoop is a distributed system infrastructure developed by the Apache Foundation. It allows users to develop reliable and scalable distributed computing applications without knowing the underlying details of the distribution.

The Apache Hadoop framework allows users to use a simple programming model to implement distributed processing of large datasets in computer clusters. Its purpose is to support the expansion from a single server to thousands of machines, making full use of the local computing and storage provided by each machine, rather than relying on hardware to provide high availability. It is designed as a library to detect and handle faults at the application layer, and for computer clusters, the top layer of each machine is designed to be fault-tolerant in order to provide a highly available service.

The core design of Apache Hadoop's framework is: HDFS and MapReduce. HDFS provides storage for huge amounts of data, while MapReduce provides computing for huge amounts of data.

Introduction to Apache Hadoop

As mentioned in the previous section MapReduce, Apache Hadoop was inspired by Google's GFS, which produced Apache Hadoop's distributed file system NDFS (Nutch Distributed File System), and MapReduce, which was incorporated into Apache Hadoop as one of the core components.

The prototype of Apache Hadoop began with Apache's Nutch in 2002. Nutch is an open source search engine implemented by Java. It provides us with all the tools we need to run our own search engine, including full-text search and Web crawlers.

Subsequently, in 2003, Google published a technical academic paper on the Google file system (GFS). GFS, also known as Google File System, is a special file system designed by Google to store massive search data.

In 2004, Doug Cutting, founder of Nutch (and founder of Apache Lucene), based on Google's GFS paper, implemented a distributed file storage system called NDFS.

In 2004, Google published another technical academic paper, which introduced MapReduce to the world. In 2005, Doug Cutting realized this function in Nutch search engine based on MapReduce.

2006, Yahoo! Doug Cutting,Doug Cutting was hired to name the NDFS and MapReduce upgrades Hadoop. Yahoo! Set up an independent team for Goug Cutting to specialize in Hadoop research and development.

In January 2008, Hadoop became the top-level project of Apache. Since then, Hadoop has been successfully applied to other companies, including Last.fm, Facebook, the New York Times and so on.

February 2008, Yahoo! Announced that its search engine product is deployed on a Hadoop cluster with 10, 000 cores.

In April 2008, Hadoop broke the world record as the fastest system to sort 1TB data. For a record of the report, see "Apache Hadoop Wins Terabyte Sort Benchmark" (see https://developer.yahoo.com/blogs/hadoop/apache-hadoop-wins-terabyte-sort-benchmark-408.html).

So far, the latest version of Apache Hadoop is 2.7.3.

Apache Hadoop has the following main advantages:

High reliability. The ability of Hadoop to store and process data bit by bit is trustworthy.

High scalability. Hadoop distributes data and completes computing tasks among available computer clusters, which can be easily extended to thousands of nodes.

High efficiency. Hadoop can move data dynamically between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

High fault tolerance. Hadoop can automatically save multiple copies of data and automatically reassign failed tasks.

Low cost. Hadoop is open source, so the software cost of the project will be greatly reduced.

Apache Hadoop core components

Apache Hadoop includes the following modules:

Hadoop Common: common utilities to support other Hadoop modules.

Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application data.

Hadoop YARN: a framework for job scheduling and cluster resource management.

Hadoop MapReduce: a parallel processing system for large datasets based on YARN.

Other projects related to Apache Hadoop include:

Ambari: a Web-based tool for configuring, managing, and monitoring Apache Hadoop clusters, including support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop. Ambari also provides a dashboard to view the health of the cluster, such as heat maps, and MapReduce, Pig, and Hive applications that can be viewed in a user-friendly manner, making it easy to diagnose their performance.

Avro: data serialization system.

Cassandra: scalable, multi-master database without a single point of failure.

Chukwa: data acquisition system for managing large distributed systems.

HBase: an extensible distributed database that supports large table storage of structured data. (the content of HBase will be covered in later chapters.)

Hive: data warehouse infrastructure that provides data summarization and specific queries.

Mahout: an extensible machine learning and data mining library.

Pig: a high-level data flow parallel computing language and execution framework.

Fast and general computing engine for Spark:Hadoop data. Spark provides a simple and powerful programming model to support a wide range of applications, including ETL, machine learning, streaming and graphical computing. (the content of Spark will be covered in later chapters.)

TEZ: a general data flow programming framework based on Hadoop YARN. It provides a powerful and flexible engine to perform arbitrary DAG tasks to achieve batch and interactive data processing. TEZ is being adopted by other frameworks in the Hive, Pig, and Hadoop ecosystems, and it is also possible to replace Hadoop MapReduce as the underlying execution engine through other commercial software, such as ETL tools.

ZooKeeper: a high-performance distributed application coordination service. (the content of ZooKeeper will be covered in later chapters.)

Many people know that I have big data training materials, and they naively think that I have a full set of big data development, hadoop, spark and other video learning materials. I would like to say that you are right. I do have a full set of video materials developed by big data, hadoop and spark.

If you are interested in big data development, you can add a group to receive free study materials: 763835121

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.