The learning route from java to big data 07/11 Update SLTechnology News&Howtos

The learning route from java to big data

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. introduction of big data's related work

Second, the skill requirements of big data engineer

Big data's study Plan

Introduction of big data

In view of the above four main characteristics, we need to consider the following questions:

With a wide range of data sources, how to collect and summarize? Corresponding to the emergence of tools such as Sqoop,Cammel,Datax

After data collection, how to store it? Corresponding to the emergence of distributed file storage systems such as GFS,HDFS,TFS.

Because of the rapid growth of data, data storage must be scalable horizontally.

After the data is stored, how to quickly convert it into a consistent format through operations, and how to quickly calculate the results you want?

The corresponding distributed computing framework such as MapReduce solves this problem; but writing MapReduce requires a large amount of Java code, so there are parsing engines such as Hive,Pig that convert SQL into MapReduce.

Ordinary MapReduce processing data can only be processed in batches, and the time delay is too long. In order to achieve the result of each input of data, a low-latency streaming computing framework such as Storm/JStorm has emerged.

However, if you need both batch and stream processing, you have to build two clusters, Hadoop cluster (including HDFS+MapReduce+Yarn) and Storm cluster, which is not easy to manage, so there is an one-stop computing framework such as Spark, which can process both batch and stream (in essence, micro-batch processing).

Then the emergence of Lambda architecture and Kappa architecture provides a general architecture for business processing.

In order to improve work efficiency and speed up transportation, some auxiliary tools have emerged:

Ozzie,azkaban: a tool for scheduled task scheduling.

Hue,Zepplin: graphical task execution management, result viewing tool.

Scala language: the best language to write Spark programs, of course, you can choose to use Python.

Python language: used when writing some scripts.

Allluxio,Kylin, etc.: tools to speed up the operation speed by preprocessing the stored data.

The above roughly enumerates the problems solved by the tools used in big data's ecology, and knows why they appear or what problems they appear to solve.

Text

I. introduction of big data's related work

At present, big data's work is mainly divided into three main directions:

Engineer big data

Data analyst

Big data scientist

Other (data mining, etc.)

Second, the skill requirements of big data engineer

Attached is the skill map of big data engineer:

11 skills that must be mastered

Java Advanced (virtual machine, concurrent)

Basic operation of Linux

Hadoop (HDFS+MapReduce+Yarn)

HBase (JavaAPI operation + Phoenix)

Hive (basic operation and principle understanding of Hql)

Kafka

Storm/JStorm

Scala

Python

Spark (Core+sparksql+Spark streaming)

Auxiliary gadgets (Sqoop/Flume/Oozie/Hue, etc.)

6 high-level skills

Machine learning algorithm and mahout library plus MLlib

R language

Lambda architecture

Kappa architecture

Kylin

Alluxio

III. Learning path

Suppose you can set aside 3 hours of effective study time every day, plus 10 hours of effective study time every day on weekends.

There will be (213-4210) 3-423 hours of study in 3 months.

The first stage (basic stage)

1) Linux study (learn ok from Birdbrother)-20 hours

Introduction and installation of Linux operating system.

Linux common commands.

Linux commonly used software installation.

Linux network.

Firewall.

Shell programming and so on.

2) Java advanced learning ("in-depth understanding of Java virtual machine", "Java high concurrency combat")-30 hours

Master multithreading.

Master and send the queue under the package.

Learn about JMS.

Master JVM technology.

Master reflection and dynamic proxy.

3) Zookeeper learning

Introduction to Zookeeper distributed coordination service.

Installation and deployment of the Zookeeper cluster.

Zookeeper data structures, commands.

The principle and election mechanism of Zookeeper.

The second stage (attack the key stage)

4) Hadoop ("Hadoop authoritative Guide")-80 hours

HDFS

The concept and characteristics of HDFS.

Shell operation of HDFS.

The working mechanism of HDFS.

Java application development of HDFS.

MapReduce

Run the WordCount sample program.

Understand the inner workings of MapReduce.

The MapReduce program runs process parsing.

The mechanism for determining the number of MapTask concurrency.

Application of combiner components in MapReduce.

Serialization framework and application in MapReduce.

Sort in MapReduce.

Custom partition implementation in MapReduce.

The shuffle mechanism of MapReduce.

MapReduce uses data compression for optimization.

The relationship between MapReduce program and YARN.

Optimization of MapReduce parameters.

Java Application Development of MapReduce

5) Hive ("Hive Development Guide")-20 hours

Basic concepts of Hive

Hive application scenarios.

The relationship between Hive and hadoop.

Hive is compared with traditional databases.

Data storage mechanism of Hive.

Basic operation of Hive

DDL operation in Hive.

How to implement efficient JOIN query in Hive.

Built-in function application of Hive.

Advanced usage of Hive shell.

Hive common parameter configuration.

Hive custom functions and the use of Transform skills.

Hive UDF/UDAF development example.

Hive execution process Analysis and Optimization Strategy

6) HBase ("HBase authoritative Guide")-20 hours

Introduction to hbase.

Habse installation.

Hbase data model.

The hbase command.

Hbase development.

Hbase principle.

7) Scala ("learn Scala")-20 hours

Overview of Scala.

Scala compiler installation.

Basic of Scala.

Arrays, maps, tuples, collections.

Class, object, inheritance, trait.

Pattern matching and sample classes.

Understand Scala Actor concurrent programming.

Understand Akka.

Understand Scala higher order functions.

Understand Scala implicit conversion.

8) Spark ("Spark authoritative Guide")-60 hours

Enter image description here

Spark core

Overview of Spark.

Spark cluster installation.

Execute the first Spark case program (ask for PI).

RDD

Enter image description here

Overview of RDD.

Create a RDD.

RDD programming API (Transformation and Action Operations).

Dependency relationship of RDD

Caching of RDD

DAG (directed acyclic graph)

Spark SQL and DataFrame/DataSet

Enter image description here

Overview of Spark SQL.

DataFrames .

DataFrame common operations.

Write Spark SQL query program.

Spark Streaming

Enter image description here

Overview of park Streaming.

Understand DStream.

DStream related operations (Transformations and Output Operations).

Structured Streaming

Other (MLlib and GraphX)

If this part of the general work is not data mining, machine learning is generally not needed, you can wait until it is needed before in-depth study.

9) Python

10) build a cluster with virtual machines, install all the tools, and develop a small demo for 30 hours

You can use VMware to build 4 virtual machines, and then install the above software to build a small cluster (I personally test, I7 64-bit, 16 GB memory, can be run)

Big data's future prospect is promising, and there are many people in the profession, and how to quickly complete the transformation, how to quickly enter the field of big data, we need transformers and rookies to think deeply.

For Xiaobai to learn big data need to pay attention to a lot of points, but in any case, since you chose to enter the big data industry, then only pay attention to both ups and downs. As the saying goes, do not forget the original ideal and ambition, in order to complete the mission, what you need most to learn big data is a persevering heart.

Here I still want to recommend the big data Learning Exchange Group I built myself: 529867072, all of them are developed by big data. If you are studying big data, the editor welcomes you to join us. Everyone is a software development party. Irregularly share practical information (only related to big data software development), including the latest big data advanced materials and advanced development tutorials sorted out by myself. Welcome to join us if you want to go deep into big data.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.