Beginner big data, this is the most complete learning path. 02/09 Update SLTechnology News&Howtos

Beginner big data, this is the most complete learning path.

2026-02-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The purpose of this article is to plan a clear learning route for all big data beginners and help them start big data's learning journey. In view of the gorgeous and complicated technology in big data's field, each big data beginner should formulate his own learning path according to his own actual situation.

What is the hottest thing in the IT industry right now? ABC is second to none. The so-called ABC, AI + Big Data + Cloud, that is, artificial intelligence, big data and cloud computing (cloud platform). At present, there are industry leaders leading the way in every field. Today we will discuss the direction of big data.

Beginner big data, this is the most complete learning path.

Big data's concept

Role

In my humble opinion, there are two kinds of roles in big data industry at present:

Big data project

Big data's analysis

What does it mean that these two types of roles are interdependent but operate independently? Without big data's project, big data's analysis would be impossible, but without big data's analysis, I really can't think of the reason for the existence of big data's project. This is similar to getting married and falling in love-the purpose of falling in love is to get married, and falling in love without marriage is hooliganism.

Beginner big data, this is the most complete learning path.

And big data's analytical role is focused on how to use data-- that is, how to provide productive data analysis for an enterprise or organization after receiving data from big data's engineering system. And it can really help the company improve its business or improve its service level, so for big data analysts, the first problem they solve is to discover and make use of the value of the data. It may include trend analysis, modeling and prediction analysis, etc.

To sum up briefly, big data's engineering role needs to consider the collection, calculation (or processing) and preservation of data, while big data's analytical role is to perform advanced calculations of data.

Which role do we belong to?

Big data's engineer orientation

Now that we understand the classification of roles in big data's field, we naturally need to "sit in the right seat" to determine our own positioning, so that we can start big data's study in a targeted way. In considering this issue, we need to take into account two factors:

Professional knowledge background

Industry experience

Beginner big data, this is the most complete learning path.

Computer professional knowledge, such as operating system, programming language, computer operation principle, etc.

Mathematical knowledge, which refers to higher mathematics, such as calculus, probability and statistics, linear algebra and discrete mathematics, not mathematics such as x x + y y = 1.

Industry experience refers to your work experience in related fields, which can be divided into three categories:

Green hand

An experienced engineer

Senior expert-now has a more cool name in the field of big data: data scientist, such as Dr. Wu Enda, former chief data scientist of Baidu.

Beginner big data, this is the most complete learning path.

After determining our own position, we need to correspond to a specific role of big data. Here are some basic rules:

If you have a good programming foundation and have an in-depth understanding of computer interaction and the underlying technical principles of the Internet, but do not have a deep grasp of mathematics and statistics, then big data project may be the direction of your study in the future.

If you have a certain programming foundation (master some high-level languages, such as Python, etc.) and strong mathematical skills, then big data analysis is the direction of your efforts today.

Big data's learning route

No matter which role you belong to, there are some theoretical knowledge of big data that you must master, including but not limited to:

Data fragmentation and routing: choose a typical partitioning algorithm to learn, such as consistent hashing algorithm ([url=] https://en.wikipedia.org/wiki/Consistent_hashing[/url])

Backup mechanism and consistency:

Learning is regarded as the "Bible" at home, but it is also the general CAP theory abroad ([url=] https://en.wikipedia.org/wiki/CAP_theorem[/url]).

Idempotency (Idempotent): the cornerstone of state management in many distributed systems [url=] https://mortoray.com/2014/09/05/what-is-an-idempotent-function/[/url]

Various consistency models: strong consistency, weak consistency, final consistency

Backup mechanism: the term "master-slave" is no longer popular. At present, the more cool is Leader-Follower mode.

Consensus agreement: it is usually translated into conformance agreement (consensus protocol) in China. Learn several common ones: Paxos and Raft

Algorithm and data structure

LSM: the difference between learning and B+ tree and what are the advantages

Compression algorithm: find a mainstream compression algorithm to understand, such as Snappy, LZ4. In addition, Facebook has recently opened up a new generation of compression algorithm: ZStandard, which is said to have destroyed all the mainstream compression algorithms.

Bloom Filter filter: big data's filter under O (1)

Whether learning big data engineering or big data analysis, these theoretical knowledge are necessary because they are necessary skills for designing many distributed systems. Let's design different learning routes for different roles.

Big data engineer skills

For big data engineer, you should master at least the following skills:

A JVM language:

Beginner big data, this is the most complete learning path.

Therefore, I suggest that you should be proficient in at least one JVM language. It is worth mentioning that it is important to understand the multithreaded model and memory model of this language. The processing mode of many big data frameworks is actually similar to the multithreaded processing model at the language level. It's just that big data framework extends them to the level of multi-machine distribution.

The author suggests: learn Java or Scala

Calculation processing framework:

Beginner big data, this is the most complete learning path.

In fact, Google has officially abandoned offline processing represented by MapReduce within the company. Therefore, if you want to learn big data project, it is necessary to master a real-time streaming framework. The current mainstream frameworks include: Apache Samza, Apache Storm, Apache Spark Streaming and Apache Flink, which has been in the limelight in the past year. Of course, Apache Kafka also launched its own streaming framework: Kafka Streams

The author suggests:

Learn one of Flink, Spark Streaming, or Kafka Streams

O be familiar with this article by Google: "The world beyond batch: Streaming 101", address is https://www.oreilly.com/ideas/th … Batch-streaming-101

Distributed storage framework:

Although MapReduce is a little out of date, HDFS, another cornerstone of Hadoop, is still strong and is the most popular distributed storage in the open source community, and you definitely take the time to learn. If you want to study in depth, Google's GFS paper must also be read ([url=] https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf[/url]). Of course, there are a lot of distributed storage in the open source world, and Alibaba's OceanBase is also a very excellent one.

O the author suggests: learn HDFS

Resource scheduling framework:

Docker has been very popular for the last year or two. Companies are working on Docker-based container solutions, the most famous open source container scheduling framework is K8S, but also Hadoop's YARN and Apache Mesos. The latter two can schedule not only container clusters, but also non-container clusters, which is worth learning.

O the author suggests: learn YARN

Beginner big data, this is the most complete learning path.

There are some common functions that need to be implemented in all major big data distributed frameworks, such as service discovery, leader election, distributed locks, KV storage, and so on. These functions also give birth to the development of distributed coordination framework. The oldest and most famous is Apache Zookeeper, and some new ones include Consul,etcd and so on. Learning big data project, the distributed coordination framework is necessary to understand, to some extent, but also in-depth understanding.

O the author suggests that too many big data frameworks need it to learn Zookeeper--, such as Kafka, Storm, HBase and so on.

KV database:

Typical are memcache and Redis, especially Redis is developing very fast. Its concise API design and high-performance TPS are increasingly favored by the majority of users. Even if you don't learn from big data, it is very helpful to learn Redis.

O the author suggests: to learn Redis, if you have a good background in C language, you'd better be familiar with the source code. Anyway, there is not much source code.

Column storage database:

The author has spent a long time learning Oracle, but I have to admit that relational database has slowly faded out of people's view, and there are too many solutions to replace rdbms. In view of the malpractice that row storage is not suitable for big data ad-hoc query, people have developed column storage. A typical column storage database is the HBASE of the open source community. In fact, the concept of determinant storage also comes from a paper by Google: Google BigTable. If you are interested, you'd better read it: [url=] https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf[/url].

O the author suggests: learn HBASE, which is the most widely used open source determinant storage.

Message queuing:

Message queue is indispensable as the main system of "cutting peak and filling valley" in big data project processing. at present, there are many solutions in this field, including ActiveMQ,Kafka and so on. Domestic Ali has also opened up RocketMQ. The leader among them should be Apache Kafka. Many of the design ideas of Kafka are particularly consistent with the design concept of distributed streaming data processing. No wonder Jay Kreps, the original author of Kafka, is today's top god in real-time streaming.

O the author suggests that learning Kafka is not only easy to find a job (almost all big data recruitment resumes are required to Kafka:-), but also to further understand the data processing paradigm based on backup log.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.