Hadoop 04/25 Update SLTechnology News&Howtos

Hadoop

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Hadoop:

Big data cluster can only run on Linux platform.

RDBMS: tabl

Fields, data types, constraints

Structured data

Relational database plays an important role in data.

But not all data can be structured.

Structured data: structured data

Unstructured data: unstructured data

Semi-structured data: semi-structured data

Usually saved as xml, json

Google:pagerank page algorithm

Break it up into parts and process it in parallel

Cut a big problem into multiple small problems

OLAP: data Mining

Machine Learning: deep learning

Multi-node parallel processing

Map reduce:

Functional programming API

Operation framework

HDFS + Mapreduce=Hadoop

HDFS:

Namenode:NN node

Datanode:DN node

MapReduce:

JobTracker:JT node

TaskTracker:TT node

Hadoop is developed in Java, while mapper,reducer is developed in Java.

Hadoop Ecology:

A mapper,reducer can be without reduce, but not without mapper

HDFS:

1. HDFS is designed to store large files, but it is not suitable for large and small files.

2. File system in user space

3. HDFS does not support modification; the new version supports appending

4. It does not support mounting and can be accessed through system calls. You can only use dedicated access interfaces, such as dedicated command line tools and API.

Scribe, facebook

Flume

Hadoop peripheral components

Hadoop cluster ecology, ecosphere

Hive intermediate component

Technology is scene-oriented.

Data modification can be done based on HBASE

HBASE is NoSQL, sparse format storage scheme

Cloudera, CDH famous hadoop technology service provider is similar to redhat

Import relational database data into Hadoop flowchart:

RDBMS-- > Sqoop-- > Hbase-- > HDFS

Avro: serializing data

How to learn Hadoop

1. Install and configure HDFS

2. Install and configure MapReduce

3 、 HBase

4 、 Hive

5 、 sqoop

6 、 flume/scribe/chukwa

HDFS normal number of nodes: four nodes

Local mode debug mode

Pseudo-distributed (using one node)

Fully distributed (more than 4 nodes)

Multiple copies of Hadoop parallel processing system

MapReduce

Processing logic

Relational database:

Row database, table

HBase:

Column database

Key-value pair

Tools for collecting logs

Flume (ASF)

Chukwa (ASF)

Scribe (facebook)

More advanced programming interface read-in tool than hadoop

Hive SQL

Pig

Crunch Java API

Avro serialization tool

Hadoop has a strong ecological environment.

Sqoop:

Let HDFS analyze data in relational databases (Oracle, MySQl, SQL Server, DB2)

Zookeeper Management component

Ecological map

Hadoop core components:

MapReduce

HDFS

R language

R is the language and operating environment for statistical analysis and drawing. R is a free, free and open source software belonging to GNU system. It is an excellent tool for statistical calculation and statistical mapping.

There are five basic processes in pseudo-distributed system:

JobTracker

TaskTracker

NameNode

SecondaryNameNode

DataNode

The compatibility between the components of Hadoop ecosystem is not very good. The components come from various open source projects.

Cloudera CDH combined distribution is a branch of Hadoop, and the more famous

Various configuration files .xml

Address and port on which the Hadoop process listens

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.