Big data's Learning Route-thinking Map of Oracle Senior Technical Director 04/16 Update SLTechnology News&Howtos

Big data's Learning Route-thinking Map of Oracle Senior Technical Director

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Recently, many people have asked me how to learn from big data. I also thought about it for a long time before I started to write this article. On the one hand, I am only a primary school student in big data's study, and I am afraid of being amused; on the other hand, big data's own field is broad and profound, and it is really difficult to cover a wide range of technologies in one article. How to "beat the children on rainy days, idle is also idle", the author has always insisted on writing technical blogs, today, please allow me to write some things like retreat.

The purpose of this article is to plan a clear learning route for all big data beginners and help them start big data's learning journey. In view of the gorgeous and complicated technology in big data's field, each big data beginner should formulate his own learning path according to his own actual situation.

What is the hottest thing in the IT industry right now? ABC is second to none. The so-called ABC, AI + Big Data + Cloud, that is, artificial intelligence, big data and cloud computing (cloud platform). At present, there are industry leaders leading the way in every field. Today we will discuss the direction of big data.

Big data, or Big Data, has a lot of definitions about it, and I'm not going to repeat them here. The most authoritative definition is the definition of IBM, which readers can consult by themselves. Since this article focuses on how to learn from big data, we should first define the different role settings in big data's field. Only in this way can you find your own position according to your actual situation and start the learning process.

Role

In my humble opinion, there are two kinds of roles in big data industry at present:

Big data project

Big data's analysis

What does it mean that these two types of roles are interdependent but operate independently? Without big data's project, big data's analysis would be impossible, but without big data's analysis, I really can't think of the reason for the existence of big data's project. This is similar to getting married and falling in love-the purpose of falling in love is to get married, and falling in love without marriage is hooliganism. Specifically, the big data project needs to solve the work of defining, collecting, calculating and saving data, so big data engineers first consider the problem of high availability of data when designing and deploying such systems. That is, big data engineering system needs to provide data services for downstream business systems or analysis systems in real time. And big data's analytical role is focused on how to use data-- that is, how to provide productive data analysis for an enterprise or organization after receiving data from big data's engineering system. And it can really help the company improve its business or improve its service level, so for big data analysts, the first problem they solve is to discover and make use of the value of the data. It may include trend analysis, modeling and prediction analysis, etc. To sum up briefly, big data's engineering role needs to consider the collection, calculation (or processing) and preservation of data, while big data's analytical role is to perform advanced calculations of data.

Which role do we belong to?

Now that we understand the classification of roles in big data's field, we naturally need to "sit in the right seat" to determine our own positioning, so that we can start big data's study in a targeted way. In considering this issue, we need to take into account two factors:

Professional knowledge background

Industry experience

The background of professional knowledge here does not refer to your academic background or college background, but your knowledge of certain IT technologies. Even if you are not a computer major, as long as you are passionate about the C language, Dennis Ritchie, the father of C, dare not underestimate you. Therefore, there are only two kinds of expertise here:

Computer professional knowledge, such as operating system, programming language, computer operation principle, etc.

Mathematical knowledge, which refers to higher mathematics, such as calculus, probability and statistics, linear algebra and discrete mathematics, not mathematics such as x x + y y = 1.

Industry experience refers to your work experience in related fields, which can be divided into three categories:

Green hand

An experienced engineer

Senior expert-now has a more cool name in the field of big data: data scientist, such as Dr. Wu Enda, former chief data scientist of Baidu.

Okay, we can now define our roles according to the above categories. For example, for the author, I define myself as: "I am an engineer majoring in computer science. I have a certain mathematical foundation (especially in calculus and linear algebra), but mathematical statistics and probability theory are my weaknesses." In addition, it's best not to make your face swollen and fat. If you don't have much experience before, it's okay to admit that you are a rookie. The key is to find your own position. After determining our own position, we need to correspond to a specific role of big data. Here are some basic rules:

If you have a good programming foundation and have an in-depth understanding of computer interaction and the underlying technical principles of the Internet, but do not have a deep grasp of mathematics and statistics, then big data project may be the direction of your study in the future.

If you have a certain programming foundation (master some high-level languages, such as Python, etc.) and strong mathematical skills, then big data analysis is the direction of your efforts today.

Learning route

No matter which role you belong to, there are some theoretical knowledge of big data that you must master, including but not limited to:

Data fragmentation and routing: choose a typical partitioning algorithm to learn, such as consistent hashing algorithm ([url=] https://en.wikipedia.org/wiki/Consistent_hashing[/url])

Backup mechanism and consistency:

Learning is regarded as the "Bible" at home, but it is also the general CAP theory abroad ([url=] https://en.wikipedia.org/wiki/CAP_theorem[/url]).

Idempotency (Idempotent): the cornerstone of state management in many distributed systems [url=] https://mortoray.com/2014/09/05/what-is-an-idempotent-function/[/url]

Various consistency models: strong consistency, weak consistency, final consistency

Backup mechanism: the term "master-slave" is no longer popular. At present, the more cool is Leader-Follower mode.

Consensus agreement: it is usually translated into conformance agreement (consensus protocol) in China. Learn several common ones: Paxos and Raft

Algorithm and data structure

LSM: the difference between learning and B+ tree and what are the advantages

Compression algorithm: find a mainstream compression algorithm to understand, such as Snappy, LZ4. In addition, Facebook has recently opened up a new generation of compression algorithm: ZStandard, which is said to have destroyed all the mainstream compression algorithms.

Bloom Filter filter: big data's filter under O (1)

Whether learning big data engineering or big data analysis, these theoretical knowledge are necessary because they are necessary skills for designing many distributed systems. Let's design different learning routes for different roles:

Engineer big data

For big data engineer, you should master at least the following skills:

A JVM language: at present, big data ecological JVM language class accounts for a large proportion, to some extent, monopoly is not too much. Here I recommend you to learn Java or Scala, as for the language like Clojure is not easy to use, in fact, it is not recommended to use. In addition, this is the era of "mother is precious to children", and some big data framework will fire the popularity of its programming languages, such as Docker to Go and Kafka to Scala. Therefore, I suggest that you should be proficient in at least one JVM language. It is worth mentioning that it is important to understand the multithreaded model and memory model of this language. The processing mode of many big data frameworks is actually similar to the multithreaded processing model at the language level. It's just that big data framework extends them to the level of multi-machine distribution.

The author suggests: learn Java or Scala

Computational processing framework: strictly speaking, this is divided into offline batch processing and streaming processing. Streaming is the trend of the future, and it is recommended that we must learn; while offline batch processing is almost out of date, its batch processing idea can not deal with infinite data sets, so its scope of application is narrowing day by day. In fact, Google has officially abandoned offline processing represented by MapReduce within the company. Therefore, if you want to learn big data project, it is necessary to master a real-time streaming framework. The current mainstream frameworks include: Apache Samza, Apache Storm, Apache Spark Streaming and Apache Flink, which has been in the limelight in the past year. Of course, Apache Kafka also launched its own streaming framework: Kafka Streams

The author suggests: learn one of Flink, Spark Streaming or Kafka Streams

Read this article by Google: "The world beyond batch: Streaming 101" at https://www.oreilly.com/ideas/th... Batch-streaming-101

Distributed storage framework: although MapReduce is a bit out of date, HDFS, another cornerstone of Hadoop, is still strong and is the most popular distributed storage in the open source community. You definitely take the time to learn. If you want to study in depth, Google's GFS paper must also be read ([url=] https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf[/url]). Of course, there are a lot of distributed storage in the open source world, and Alibaba's OceanBase is also a very excellent one.

The author suggests: learn HDFS

Resource scheduling framework: Docker has been popular for the last year or two. Companies are working on Docker-based container solutions, the most famous open source container scheduling framework is K8S, but also Hadoop's YARN and Apache Mesos. The latter two can schedule not only container clusters, but also non-container clusters, which is worth learning.

The author suggests: learn YARN

Distributed coordination framework: there are some common functions that need to be implemented in all mainstream big data distributed frameworks, such as service discovery, leader election, distributed locks, KV storage, etc. These functions also give birth to the development of distributed coordination framework. The oldest and most famous is Apache Zookeeper, and some new ones include Consul,etcd and so on. Learning big data project, the distributed coordination framework is necessary to understand, to some extent, but also in-depth understanding.

The author suggests that too many big data frameworks need it to learn Zookeeper--, such as Kafka, Storm, HBase and so on.

KV database: typical are memcache and Redis, especially Redis is developing very fast. Its concise API design and high-performance TPS are increasingly favored by the majority of users. Even if you don't learn from big data, it is very helpful to learn Redis.

The author suggests: to learn Redis, if you have a good knowledge of the C language, you'd better be familiar with the source code. Anyway, there is not much source code.

Column storage database: I have spent a long time learning Oracle, but I have to admit that relational database has slowly faded out of people's view, and there are too many alternatives to rdbms. In view of the malpractice that row storage is not suitable for big data ad-hoc query, people have developed column storage. A typical column storage database is the HBASE of the open source community. In fact, the concept of determinant storage also comes from a paper by Google: Google BigTable. If you are interested, you'd better read it: [url=] https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf[/url].

The author suggests: learn HBASE, which is the most widely used open source column storage.

Message queue: as the main system of "cutting peak and filling valley" in big data project processing, message queue is indispensable. At present, there are many solutions in this field, including ActiveMQ,Kafka and so on. Domestic Ali has also opened up RocketMQ. The leader among them should be Apache Kafka. Many of the design ideas of Kafka are particularly consistent with the design concept of distributed streaming data processing. No wonder Jay Kreps, the original author of Kafka, is today's top god in real-time streaming.

The author suggests that learning Kafka is not only easy to find a job (almost all big data recruitment resumes are required to Kafka:-), but also to further understand the data processing paradigm based on backup log.

Big data analyst or data scientist

To become a data scientist, you must master at least the following skills:

Mathematical skills: calculus is strictly mastered. It is not necessary to master multivariate calculus, but unary calculus must be skillfully mastered and used. In addition, linear algebra must be proficient, especially the concepts of matrix operation, vector space, rank and so on. In the current machine learning framework, many calculations need to use matrix multiplication, transpose or inverse. Although many frameworks provide such tools directly, we should at least understand the internal prototype principles, such as how to efficiently judge whether a matrix has an inverse matrix and how to calculate it.

Reviewing the Tongji version of Advanced Mathematics, you can go to Coursea to study calculus at the University of Pennsylvania.

It is recommended to learn Strang's linear algebra: "Introduction to Linear Algebra"-this is the most classic textbook, not one of them!

Mathematical statistics: probability theory and various statistical methods should be basically mastered, such as how to calculate Bayesian probability? What's with the probability distribution? Although proficiency is not required, it is important to understand the relevant background and terminology.

Find a book on probability and restudy it.

* * Interactive data analysis framework: this does not refer to SQL or database queries, but analysis interaction frameworks such as Apache Hive or Apache Kylin. There are many similar frameworks in the open source community, and traditional data analysis methods can be used for data analysis or data mining of big data. The author has experience in using Hive and Kylin. However, Hive, especially Hive1, is based on MapReduce, and its performance is not particularly good. Kylin uses the concept of data cube combined with star model, which can achieve very low latency analysis speed. Moreover, Kylin is the first Apache incubation project whose main force of the R & D team is Chinese, so it has attracted more and more attention.

First of all, learn Hive, if you have time, learn about Kylin and the idea of data mining behind it! Finally, I would like to say:

> many people know that I have big data training materials, and they naively think that I have a full set of big data development, hadoop, spark and other video learning materials. I would like to say that you are right. I do have a full set of video materials developed by big data, hadoop and spark.

If you are interested in big data development, you can add a group to get free learning materials: 763835121

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.