The knowledge points that big data needs to master (novice) 04/20 Update SLTechnology News&Howtos

The knowledge points that big data needs to master (novice)

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. the foundation needed to learn from big data

Java SE,EE (SSM)

90% of big data's frames are written by java.

MySQL

SQL on Hadoop

Linux

Big data's framework is installed on the Linux operating system

Second, what do you need to learn the first aspect: big data offline analysis

General processing of Thum1 data

Hadoop 2.x: (common, HDFS, MapReduce, YARN)

The idea of setting up the environment and processing data

Hive:

Big data data Warehouse

Manipulate data by writing SQL, similar to sql in mysql database

HBase

NOSQL database based on HDFS

Column-oriented storage

Collaboration Framework:

Sqoop (Bridge: HDFS "=" RDBMS)

Flume: collect information in log files

Scheduling framework anzkaban, understand: crotab (included with Linux), zeus (Alibaba), Oozie (cloudera)

Extend the frontier framework:

Kylin, impala, ElasticSearch (ES)

Note: my other blog post about the first aspect has a detailed summary (which is what I got from searching a lot of online materials, which can save you a lot of time)

The second aspect: real-time analysis of big data

Mainly based on spark framework

Scala:OOP + FP

SparkCore: analogical MapReduce

SparkSQL: analogical hive

SparkStreaming: real-time data processing

Kafka: message queuing

Frontier Framework extension: flink

Alibaba blink

The third aspect: big data machine learning (expansion)

Spark MLlib: machine Learning Library

Pyspark programming: the combination of Python and spark

Recommendation system

Python data analysis

Python machine learning

Big data framework installs the function to divide massive data storage:

HDFS, Hive (essentially storing data or hdfs), HBASE, ES

Massive data analysis:

MapReduce 、 Spark 、 SQL

The most primitive Hadoop framework

Data storage: HDFS (Hadoop Distributed File System)

Data analysis: MapReduce

The Origin of Hadoop three papers by Google

Although Google did not release the source code of these three products

But he published detailed design papers for these three products.

Laid the foundation of the popular big data algorithm all over the world!

Google FS HDFSMapReduce MapReduceBigTable HBase

The tasks are decomposed and then processed at the same time in multiple computing nodes with weak processing power, and then the results are merged to complete big data processing.

Google:android, search, big data framework, artificial intelligence framework

Pagerank

Hadoop introduction

Most of big data's frameworks belong to Apache top-level projects.

Http://apache.org/

Hadoop official website:

Http://hadoop.apache.org/

Distributed system

Relative to [centralized]

Multiple machines are needed to assist in the completion.

Metadata: data that records data

Architecture:

Master node Master boss, manager

Administration and Management

Slave node Slave subordinate, slave, managed

Work

Hadoop is also a distributed architecture

Common

HDFS:

Master node: NameNode

Determines which DataNode the data is stored on.

Slave node: DataNode

Store data

MapReduce:

The idea of divide and rule

The vast amount of data is divided into multiple parts, each part of the data is processed separately, and finally all the results are merged.

Map task

Deal with each part of the data separately,

Reduce task

Merge the output of map task

YARN:

Distributed cluster resource management framework, managing cluster resources (Memory,cpu core)

Reasonable scheduling and allocation to each program (MapReduce) for use

Master node: resourceManager

Take charge of the resources in the cluster

Slave node: nodeManager

Manage the resources of each cluster

Summary: installation and deployment of Hadoop

All belong to the java process, that is, the JVM process is started and the service is run.

HDFS: stores data and provides data for analysis

NameNode/DataNode

YARN: the resource on which the provider runs

ResourceManager/NodeManager

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.