The learning route that must be taken by big data developers 07/02 Update SLTechnology News&Howtos

The learning route that must be taken by big data developers

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Introduction:

Chapter one: first acquaintance of Hadoop

Chapter 2: more efficient WordCount

Chapter 3: get the data from other places to Hadoop

Chapter 4: take the data on Hadoop elsewhere

Chapter 5: hurry up, my SQL

Chapter 6: polygamy

Chapter 7: more and more analytical tasks

Chapter 8: my data needs to be real-time

Chapter 9: my data should be external.

Chapter 10: awesome machine learning

Often beginners ask me on blogs and QQ that they want to develop in the direction of big data, which techniques they should learn, and what the learning route is. I think big data is very popular, has a good job, and earns a high salary. If you are very confused and want to develop in the direction of big data for these reasons, then I would like to ask, what is your major and what are your interests in computer / software? Is a computer major, interested in operating system, hardware, network, server? Is a software major, interested in software development, programming, writing code? Majored in mathematics and statistics, especially interested in data and numbers.

In fact, this is to tell you big data's three development directions, platform building / optimization / operation and maintenance / monitoring, big data development / design / architecture, data analysis / mining. Please don't ask me which is easy, which has a good prospect, and which has more money.

First, let's talk about big data's 4V features:

Large amount of data, TB- > PB

There are many types of data, such as structured, unstructured text, logs, videos, pictures, geographic locations, etc.

Commercial value is high, but this value needs to be mined more quickly through data analysis and machine learning on top of massive data.

With high timeliness of processing, the processing requirements of massive data are no longer limited to offline computing.

Nowadays, in order to deal with these characteristics of big data, there are more and more open source big data frameworks. Let's first list some common ones:

File storage: Hadoop HDFS, Tachyon, KFS

Offline calculation: Hadoop MapReduce, Spark

Streaming, real-time computing: Storm, Spark Streaming, S4, Heron

Kmurv, NOSQL database: HBase, Redis, MongoDB

Resource management: YARN, Mesos

Log collection: Flume, Scribe, Logstash, Kibana

Message system: Kafka, StormMQ, ZeroMQ, RabbitMQ

Query analysis: Hive, Impala, Pig, Presto, Phoenix, SparkSQL, Drill, Flink, Kylin, Druid

Distributed Coordination Service: Zookeeper

Cluster management and monitoring: Ambari, Ganglia, Nagios, Cloudera Manager

Data mining, machine learning: Mahout, Spark MLLib

Data synchronization: Sqoop

Task scheduling: Oozie

……

Dazzled, there are more than 30 kinds above, not to mention proficient, all will be used, it is estimated that there are only a few.

As far as I am concerned, the main experience is in the second direction (development / design / architecture). Let's listen to my advice.

Chapter one: get to know Hadoop1.1, learn Baidu and Google.

No matter what problem you encounter, try to search and solve it yourself.

Google first choice, can not turn over, use Baidu.

1.2 Resources preferred official documentation

Especially for starters, official documents are always the first choice.

I believe that most of the people who engage in this area are literate people. I can make do with it in English. I really can't stand it. Please refer to the first step.

1.3Let Hadoop run first.

Hadoop can be regarded as the founder of big data's storage and computing. Now most open source big data frameworks rely on Hadoop or are well compatible with it.

With regard to Hadoop, you need to at least figure out what the following are:

Hadoop 1.0 、 Hadoop 2.0

MapReduce 、 HDFS

NameNode 、 DataNode

JobTracker 、 TaskTracker

Yarn 、 ResourceManager 、 NodeManager

To build your own Hadoop, please use the first and second steps to make it run.

It is recommended that you first use the installation package command line to install, do not use administrative tools to install.

In addition: Hadoop1.0 only needs to know it, now it uses Hadoop 2.0.

1.4 try using Hadoop

HDFS directory operation command

Upload and download file command

Submit to run the MapReduce sample program

Open the Hadoop WEB interface, view the Job running status, and view the Job running log.

Know where the Hadoop Syslog is.

1.5 it's time for you to understand how they work

MapReduce: how to divide and conquer

HDFS: where is the data and what is the copy?

What exactly is Yarn and what can it do?

What on earth is NameNode doing?

What on earth is ResourceManager doing?

1.6 write a MapReduce program by yourself

Please follow the example of WordCount and write a WordCount program yourself (or copy it).

Package and submit to Hadoop to run.

You can't Java? Either Shell or Python is fine. There is something called Hadoop Streaming.

If you have completed the above steps seriously, congratulations, your foot has already come in.

Chapter 2: more efficient WordCount2.1, learn some SQL.

Do you know the database? Can you write SQL?

If not, please learn some SQL.

2.2 SQL version of WordCount

How many lines of code did you write (or copy) the WordCount in 1.6?

Let me show you mine:

SELECT word,COUNT (1) FROM wordcount GROUP BY word

This is the charm of SQL, programming requires dozens of lines, or even hundreds of lines of code, I am done with this sentence; using SQL to process and analyze data on Hadoop is convenient, efficient, easy to use, and is a trend. Whether offline computing or real-time computing, more and more big data processing frameworks are actively providing SQL interfaces.

2.3 Hive of SQL On Hadoop

What is Hive? The official explanation is:

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.

Why is Hive a data warehouse tool, not a database tool? Some friends may not know the data warehouse, which is a logical concept. The underlying data is a database. The data in the data warehouse has these two characteristics: the most complete historical data (massive) and relatively stable; the so-called relative stability means that the data warehouse is different from the business system database, and the data is often updated. Once the data enters the data warehouse, it is rarely updated and deleted, only a large number of queries. Hive also has these two characteristics, so Hive is suitable to be a data warehouse tool for massive data, rather than a database tool.

2.4 install and configure Hive

Please refer to 1.1 and 1.2 to complete the installation and configuration of Hive. You can enter the Hive command line normally.

2.5 try using Hive

Refer to 1. 1 and 1. 2 to create the wordcount table in Hive and run the SQL statement in 2. 2.

Find the SQL task you just ran in the Hadoop WEB interface.

See if the SQL query results are consistent with the results in MapReduce in 1. 4.

2.6 how does Hive work

It is clearly written as SQL, why do you see MapReduce tasks in the Hadoop WEB interface?

2.7 learn the basic commands of Hive

Create, delete tables

Load data into a table

Download data from the Hive table

Please refer to 1. 2 to learn more about Hive syntax and commands.

If you have followed the process in chapters 1 and 2 of "Writing to big data for beginners", you should already have the following skills and knowledge points:

The difference between 0 and Hadoop2.0

The principle of MapReduce (still the classic topic, a file of 10 gigabytes, given a memory of 1 gigabyte, how to use the Java program to count the top 10 words and times)

The process of HDFS reading and writing data; PUT data to HDFS; downloading data from HDFS

I can write a simple MapReduce program, run with problems, and know where to view the log.

Can write simple SQL statements such as SELECT, WHERE, GROUP BY, etc.

The general process of converting Hive SQL to MapReduce

Common statements in Hive: create tables, delete tables, load data into tables, partition, download data from tables to local

From the above learning, you have learned that HDFS is a distributed storage framework provided by Hadoop, which can be used to store massive data, MapReduce is a distributed computing framework provided by Hadoop, it can be used to count and analyze massive data on HDFS, while Hive provides SQL interface for SQL On Hadoop,Hive, developers only need to write easy-to-use SQL statements, and Hive is responsible for translating SQL into MapReduce and submitting it for operation.

At this point, your "big data platform" goes like this:

So the question is, how do huge amounts of data get to HDFS?

Chapter 3: get the data from other places to Hadoop

It can also be called data acquisition here, which collects the data from each data source to Hadoop.

3.1 HDFS PUT command

You should have used this before.

The put command is also commonly used in the real world, usually in conjunction with scripting languages such as shell and python.

Proficiency is recommended.

3.2 HDFS API

HDFS provides API for writing data, and writing data in a programming language to the HDFS,put command itself also uses API.

In the actual environment, it is rare to write programs that use API to write data to HDFS, usually using methods encapsulated by other frameworks. For example: INSERT statement in Hive, saveAsTextfile in Spark, etc.

It is recommended to understand the principle and be able to write Demo.

3.3 Sqoop

Sqoop is an open source framework mainly used for data exchange between Hadoop/Hive and traditional relational database Oracle/MySQL/SQLServer.

Just like Hive translates SQL into MapReduce, Sqoop translates your specified parameters into MapReduce, submits them to Hadoop to run, and completes the data exchange between Hadoop and other databases.

Download and configure Sqoop yourself (it is recommended that you use Sqoop1,Sqoop2 first).

Understand the configuration parameters and methods commonly used in Sqoop.

Use Sqoop to synchronize data from MySQL to HDFS

Use Sqoop to synchronize data from MySQL to Hive table

PS: if the subsequent selection determines to use Sqoop as a data exchange tool, then it is recommended to be proficient, otherwise, you can understand and use Demo.

3.4 Flume

Flume is a distributed framework for mass log collection and transmission. Because of the "collection and transmission framework", it is not suitable for data acquisition and transmission of relational databases.

Flume can collect logs from network protocol, message system and file system in real time, and transfer them to HDFS.

Therefore, if your business has data from these data sources and needs to collect them in real time, then you should consider using Flume.

Download and configure Flume.

Use Flume to monitor a file that constantly appends data and transfer the data to HDFS

The configuration and use of PS:Flume is more complex, if you do not have enough interest and patience, you can skip Flume first.

3.5 Ali's open source DataX

The reason for introducing this is because the tool for exchanging data between Hadoop and relational database, which is currently used by a teacher on our side, is based on DataX and is very easy to use.

DataX is now version 3. 0 and supports many data sources.

You can also do secondary development on top of it.

PS: if you are interested, you can study and use it and compare it with Sqoop.

If you have carefully completed the above study and practice, at this time, your "big data platform" should be like this:

Chapter 4: take the data on Hadoop elsewhere

Earlier, I introduced how to collect the data from the data source to Hadoop, and after the data is sent to Hadoop, you can use Hive and MapReduce for analysis. The next question is, how can the results of the analysis be synchronized from Hadoop to other systems and applications?

In fact, the method here is basically the same as that of Chapter 3.

4.1 HDFS GET command

GET the files on the HDFS locally. You need to be proficient.

4.2 HDFS API

Same as 3.2.

4.3 Sqoop

Same as 3.3.

Use Sqoop to complete synchronizing files on HDFS to MySQL

Use Sqoop to complete the synchronization of data from the Hive table to MySQL

4.4 DataX

Same as 3.5.

If you have carefully completed the above study and practice, at this time, your "big data platform" should be like this:

If you have followed this route once, then you should have the following skills and knowledge points:

Know how to collect existing data to HDFS, including offline acquisition and real-time acquisition

You already know that sqoop (or DataX) is a tool for data exchange between HDFS and other data sources

You already know that flume can be used for real-time log collection.

From the previous study, for big data platform, you have mastered a lot of knowledge and skills, build Hadoop clusters, collect data to Hadoop, use Hive and MapReduce to analyze data, and synchronize the analysis results to other data sources.

The next problem comes, Hive is used more and more, you will find a lot of discomfort, especially slow speed, in most cases, obviously my data is very small, it has to apply for resources, start MapReduce to execute.

Chapter 5: hurry up, my SQL

In fact, everyone has found that the Hive background uses MapReduce as the execution engine, which is a bit slow.

As a result, there are more and more frameworks for SQL On Hadoop. As far as I know, the most commonly used frameworks are SparkSQL, Impala and Presto.

These three frameworks are based on semi-memory or full memory and provide SQL interfaces to quickly query and analyze data on Hadoop.

We are currently using SparkSQL. As for why we use SparkSQL, there are probably the following reasons:

There are other things I have done with Spark. I don't want to introduce too many frameworks.

Impala needs too much memory and does not have too many resources to deploy.

5.1 about Spark and SparkSQL

What is Spark and what is SparkSQL.

The core concepts and nouns of Spark.

What is the relationship between SparkSQL and Spark, what is the relationship between SparkSQL and Hive?

Why does SparkSQL run faster than Hive?

5.2 how to deploy and run SparkSQL

What deployment models does Spark have?

How do I run SparkSQL on Yarn?

Use SparkSQL to query tables in Hive.

PS: Spark is not a technology that can be mastered in a short time, so it is recommended that after you understand Spark, you can start with SparkSQL and take it step by step.

For information about Spark and SparkSQL, please refer to http://lxw1234.com/archives/category/spark

If you have carefully completed the above study and practice, at this time, your "big data platform" should be like this:

Chapter 6: polygamy

Please don't be seduced by this name. In fact, what I want to say is a collection of data, multiple consumption.

In actual business scenarios, especially for some monitoring logs, you want to instantly understand some metrics from the logs (as for real-time computing, which will be described in the following section). At this time, it is too slow to analyze from HDFS. Although it is collected through Flume, Flume cannot scroll files to HDFS at short intervals, which will result in a large number of small files.

In order to meet the needs of one data collection and multiple consumption, what we want to say here is Kafka.

6.1 about Kafka

What is Kafka?

The core concept and noun explanation of Kafka.

6.2 how to deploy and use Kafka

Use a stand-alone deployment of Kafka and successfully run your own producer and consumer examples.

Use Java programs to write and run your own producer and consumer programs.

Integration of Flume and Kafka, using Flume to monitor logs and send log data to Kafka in real time.

If you have carefully completed the above study and practice, at this time, your "big data platform" should be like this:

At this time, the data collected by Flume is not directly sent to HDFS, but the data in first-in Kafka,Kafka can be consumed by multiple consumers at the same time, one of which is to synchronize the data to HDFS.

Next, you should already have the following skills and knowledge:

Why Spark is faster than MapReduce.

Use SparkSQL instead of Hive to run SQL faster.

Use Kafka to complete one data collection and consume the architecture multiple times.

Can write their own programs to complete the Kafka producers and consumers.

From the previous study, you have mastered most of the skills in the big data platform, such as data collection, data storage and calculation, data exchange, and so on, and each step requires a task (program) to complete. there is a certain dependence between the various tasks, for example, you must wait for the data collection task to be successfully completed before the data computing task can start to run. If a task fails, an alarm needs to be sent to the developer, operation and maintenance staff, and a complete log needs to be provided to facilitate error detection.

Chapter 7: more and more analytical tasks

It is not only an analysis task, but also a data acquisition and data exchange task. Some of these tasks are triggered regularly, while others need to be triggered by other tasks. When there are hundreds of tasks in the platform that need to be maintained and run, crontab alone is not enough, so a scheduling and monitoring system is needed to accomplish this. The dispatching and monitoring system is the central system of the whole data platform, which is similar to AppMaster and is responsible for assigning and monitoring tasks.

7.1 Apache Oozie

1. What is Oozie? What are the functions?

2. What types of tasks (programs) can be scheduled by Oozie?

3. Which task triggers can be supported by Oozie?

4. Install and configure Oozie.

7.2 other open source task scheduling systems

Azkaban:

Https://azkaban.github.io/

Light-task-scheduler:

Https://github.com/ltsopensource/light-task-scheduler

Zeus:

Https://github.com/alibaba/zeus

……

If you have carefully completed the above study and practice, at this time, your "big data platform" should be like this:

Chapter 8: my data needs to be real-time

In the sixth chapter, when introducing Kafka, we mentioned some business scenarios that require real-time indicators. Real-time can be divided into absolute real-time and quasi-real-time. Absolute real-time delay requirements are generally in milliseconds, and quasi-real-time delay requirements are generally in seconds and minutes. For business scenarios that require absolute real-time, Storm is often used. For other quasi-real-time business scenarios, it can be Storm or Spark Streaming. Of course, if you can, you can also write your own program to do it.

8.1 Storm

1. What is Storm? What are the possible application scenarios?

2. What are the core components of Storm and what roles do they play?

3. Simple installation and deployment of Storm.

4. Write the Demo program and use Storm to complete the real-time data flow calculation.

8.2 Spark Streaming

1. What is Spark Streaming and what is its relationship with Spark?

2. What are the advantages and disadvantages of Spark Streaming and Storm?

3. Use Kafka + Spark Streaming to complete the Demo program of real-time calculation.

If you have carefully completed the above study and practice, at this time, your "big data platform" should be like this:

At this point, the underlying architecture of your big data platform has been formed, including data acquisition, data storage and computing (offline and real-time), data synchronization, task scheduling and monitoring modules. Next, it is time to consider how to better provide data to the outside world.

Chapter 9: my data should be external.

Data access is usually provided to the external (business), which generally includes the following aspects:

Offline: for example, provide the data of the previous day to specified data sources (DB, FILE, FTP) every day; offline data can be provided by offline data exchange tools such as Sqoop, DataX, etc.

Real-time: for example, the recommendation system of an online website needs to obtain recommendation data to users in real time from the data platform, which requires a very low latency (less than 50 milliseconds).

According to the delay requirements and real-time data query needs, the possible solutions are: HBase, Redis, MongoDB, ElasticSearch and so on.

OLAP analysis: OLAP not only requires the underlying data model to be more standardized, but also requires higher and higher response speed of the query. The possible solutions are: Impala, Presto, SparkSQL, Kylin. If your data model is large, then Kylin is the best choice.

Ad hoc query: the data of impromptu query is relatively random, so it is generally difficult to establish a general data model, so the possible solutions are: Impala, Presto, SparkSQL.

So many mature frameworks and programs need to be combined with their own business needs and data platform technology architecture to choose the appropriate one. There is only one principle: the simpler the more stable, the best.

If you have mastered how to provide external (business) data well, then your "big data platform" should look like this:

Chapter 10: awesome machine learning

As a layman, I can only give a brief introduction to this area. I was very ashamed when I majored in mathematics and regretted not studying math well at that time.

In our business, there are about three types of problems that can be solved by machine learning:

Classification problems: including two-classification and multi-classification, two-classification is to solve the problem of prediction, just like predicting whether an email is spam; multi-classification is to solve the text classification.

Clustering problem: classify users roughly from the keywords they have searched.

Recommendation question: make recommendations according to the user's history of browsing and clicking behavior.

In most industries, machine learning is used to solve these kinds of problems.

Entry-level learning route:

Mathematical foundation

Machine learning practice (Machine Learning in Action), Python is the best.

SparkMlLib provides some encapsulated algorithms, and the methods of feature processing and feature selection.

Machine learning is really excellent, and it is also the goal of my study.

Then, you can add the machine learning part to your "big data platform".

Well, to such a process, I hope to be able to help you, if you have other problems in the operation do not understand, welcome to contact at any time!

You can scan the Wechat QR code at the bottom of the screen:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.