What is the concept of HADOOP 04/04 Update SLTechnology News&Howtos

What is the concept of HADOOP

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is the concept of HADOOP". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what is the concept of HADOOP"?

Big data: refers to the collection of data that cannot be captured, managed and processed with conventional software tools within a certain period of time. It requires a new model to have more powerful decision-making power, insight and process optimization ability. Massive, high growth rate and diversified information assets

The smallest basic unit is bit, which gives all units in order: bit, Byte, KB, MB, GB, TB, PB, EB, ZB, YB, BB, NB, DB.

1 Byte = 8 bit 1 KB = 1024 Bytes = 8192 bit 1 MB = 1024 KB = 1048576 Bytes 1 GB = 1024 MB = 1048576 KB 1 TB = 1024 GB = 1048576 MB 1 PB = 1024 TB = 1048576 GB 1 EB = 1048576 TB 1 ZB = 1024 EB = 1048576 PB 1 YB = 1024 ZB = 1048576 EB 1 BB = 1024 YB = 1048576 ZB 1 NB = 1024 BB = 1048576 YB 1 DB = 1Q 024 NB = 1cent048min576 BB

It mainly solves the problems of mass data storage and analysis and calculation.

2 the characteristics of big data

1. A large number. Big data's characteristics are first reflected as "big", from the pre-Map3 era, a small MB-level Map3 can meet the needs of many people, but with the passage of time, the storage unit from the past GB to TB, and even the current PB, EB level. With the rapid development of information technology, data began to grow explosively. Social networks (Weibo, Twitter, Facebook) -, mobile networks, a variety of smart tools, service tools, etc., have become sources of data. Nearly 400m members of Taobao generate about 20TB of merchandise trading data every day; about 1 billion of Facebook users generate more log data than 300TB. There is an urgent need for intelligent algorithms, powerful data processing platforms and new data processing technologies to count, analyze, predict and process such large-scale data in real time.

2. Diversity. A wide range of data sources determine the diversity of big data's forms. Any form of data can have an effect. At present, recommendation systems are the most widely used, such as Taobao, NetEase Cloud Music, Jinri Toutiao and so on. These platforms analyze users' log data and further recommend what users like. Log data is obviously structured data, and some data are not structured obviously, such as pictures, audio, video and so on. The causality of these data is weak, so it needs to be marked manually.

3. High speed. Big data came into being very quickly, mainly through the Internet. Everyone in life is inseparable from the Internet, that is to say, individuals provide big data with a large amount of information every day. And these data need to be processed in time, because it is very uneconomical to spend a lot of capital to store less useful historical data, for a platform, perhaps the saved data is only within the past few days or a month. No matter how far the data is, you have to clean it up in time, otherwise it will cost too much. Based on this situation, big data has very strict requirements for processing speed. A large number of resources in the server are used to process and calculate data, and many platforms need to achieve real-time analysis. Data is generated all the time, and whoever is faster will have an advantage.

4. Value. This is also the core feature of big data. In the data generated in the real world, the proportion of valuable data is very small. Compared with the traditional small data, big data's greatest value lies in mining valuable data for future trends and patterns from a large number of unrelated data of various types. and through in-depth analysis of machine learning methods, artificial intelligence methods or data mining methods, new laws and new knowledge are found and applied to agriculture, finance, medical care and other fields. So as to achieve the effect of improving social governance, improving production efficiency and promoting scientific research.

Background introduction of HADOOP

1.1 what is HADOOP

Introduce the official website hadoop.apache.com-- > you can use Baidu translation if you don't understand.

Apache Hadoop develops open source software for reliable, extensible distributed computing. The Apache Hadoop software library is a framework that allows distributed processing of large datasets (massive amounts of data) across computer clusters using a simple programming model. Include these modules:

Hadoop Common: a common tool that supports other Hadoop modules.

Hadoop distributed file system (HDFS ™): a distributed file system that provides high throughput access to application data.

Hadoop YARN: a framework for job scheduling and cluster resource management.

Hadoop MapReduce: a YARN-based system for parallel processing of large data sets.

Each of the above modules has its own independent function, and the modules are related to each other.

In a broad sense, HADOOP usually refers to a broader concept-the HADOOP biosphere.

1.2 background of HADOOP production

Apache's Nutch,Nutch, which began in 2002, is an open source Java-implemented search engine. It provides all the tools we need to run our own search engine. Including full-text search and Web crawler. The design goal of Nutch is to build a large-scale web-wide search engine, including web page crawling, indexing, query and other functions, but with the increase of the number of crawling web pages, it has encountered a serious scalability problem-"how to solve the problem of storage and indexing of billions of web pages".

In 2003, Google published a technical academic paper, Google File system (GFS). GFS is a special file system designed by google File System,google to store massive search data.

In 2004, Doug Cutting, founder of Nutch, implemented a distributed file storage system called NDFS based on Google's GFS paper.

Ps:2003-2004, Google released some details of GFS and Mapreduce ideas, based on which Doug Cutting and others spent two years in their spare time implementing DFS and Mapreduce mechanisms, a miniature version: Nutch.

In 2004, Google published another technical academic paper, MapReduce. MapReduce is a programming model for parallel analytical operations on large datasets (larger than 1TB).

In 2005, Doug Cutting realized this function in Nutch search engine based on MapReduce.

Introduction of HADOOP Application cases at Home and abroad

Log analysis of Web servers of large websites: a cluster of Web servers of large websites contains about 800GB clicks every 5 minutes, with a peak of 9 million clicks per second. Load the data into memory every 5 minutes, calculate the hotspot URL of the website at high speed, and feedback the information to the front-end cache server to improve the cache hit rate.

Operator traffic management analysis: the daily traffic data is about 2TB~5TB and copied to HDFS. Through the interactive analysis engine framework, hundreds of complex data cleaning and reporting services can be run, and the total time is two or three times faster than minicomputer clusters and DB2 with similar hardware configurations.

1.5 Analysis of the employment situation of domestic HADOOP

You can check Zhaopin online.

There are three main directions for big data's employment:

Data analysis big data talent corresponding position big data system R & D engineer

Big data talents in system research and development corresponding position big data application development engineer

Application and development big data talents corresponding to the position big data analyst

Big data's technological ecosystem

The technical terms involved in the above figure are explained as follows:

1) Sqoop:sqoop is an open source tool, which is mainly used for data transfer between Hadoop (Hive) and traditional database (mysql). It can import the data from a relational database (such as MySQL, Oracle, etc.) into the HDFS of Hadoop, and also import the data of HDFS into the relational database.

2) Flume:Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system provided by Cloudera. Flume supports customizing various data senders in the log system for data collection; at the same time, Flume provides the ability to simply process data and write to various data receivers (customizable).

3) Kafka:Kafka is a high-throughput distributed publish and subscribe messaging system with the following features:

(1) the persistence of messages is provided through the disk data structure of O (1), which can maintain stable performance for a long time even for message storage with TB. (2) High throughput: even very ordinary hardware Kafka can support millions of messages per second. (3) support partitioning messages through Kafka servers and consumer machine clusters.

(4) support Hadoop parallel data loading.

4) Storm:Storm provides a set of general primitives for distributed real-time computing, which can be used in "stream processing", real-time.

Process the message and update the database. This is another way to manage queues and worker clusters. Storm can also be used for "even"

Continue computing "(continuous computation) to continuously query the data stream, and the result will be in the form of a stream when calculating."

Output to the user.

5) Spark:Spark is the most popular open source big data memory computing framework. The calculation can be based on big data stored on Hadoop.

6) Oozie:Oozie is a workflow scheduling management system for managing Hdoop jobs (job). Oozie coordination jobs trigger the current Oozie workflow through time (frequency) and valid data.

7) Hbase:HBase is a distributed, column-oriented open source database. Different from the general relational database, HBase is a database suitable for unstructured data storage.

8) Hive:hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, provide simple sql query function, and convert sql statements into MapReduce tasks to run. Its advantage is that the learning cost is low, simple MapReduce statistics can be quickly realized through SQL-like statements, and there is no need to develop specialized MapReduce applications, so it is very suitable for statistical analysis of data warehouse.

9) Mahout:

Apache Mahout is an extensible machine learning and data mining library. Currently Mahout supports four main use cases: recommendation mining: collecting user actions to recommend things that users might like. Aggregation: collect files and group related files. Classification: learn from existing classified documents, find similar features in documents, and correctly classify untagged documents.

Frequent itemset mining: group a group of items and identify which individual items often appear together.

10) ZooKeeper:Zookeeper is an open source implementation of Google's Chubby. It is a reliable coordination system for large-scale distributed systems, which provides functions such as configuration maintenance, name service, distributed synchronization, group service and so on. The goal of ZooKeeper is to encapsulate complex and error-prone key services and provide users with simple and easy-to-use interfaces and systems with efficient performance and stable functions.

At this point, I believe you have a deeper understanding of "what is the concept of HADOOP". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.