What does big data need to learn? 07/15 Update SLTechnology News&Howtos

What does big data need to learn?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

What does big data need to learn? A lot of people have asked me this question. Every time I finished my answer, I felt that I was too one-sided, and there was always no suitable opportunity to sum up these contents until I started to write this article. Big data is an industry that has sprung up in the past five years, with rapid development. Many technologies have become more mature after these years of iteration. At the same time, new things continue to emerge. The only way to maintain their competitiveness is to keep learning.

Mind map

The following is a mind map I sorted out, which is divided into several parts, including distributed computing and query, distributed scheduling and management, persistent storage, big data's commonly used programming language, and so on. There are many open source tools under each category, which are things that big data programmers love and hate to death.

Big data development and learning has a certain degree of difficulty, zero basic entry first to learn Java language to lay the foundation, generally speaking, Java to learn SE, EE, need about 3 months of time; then enter big data's technical system of learning, mainly learning Hadoop, Spark, Storm, etc., from zero foundation to proficient learning big data QQ group: digital 606digit859 digital 705to share big data learning resources, there are bosses to guide learning, the learning route is clear.

The language Java that big data needs

Java can be said to be the most basic programming language of big data. According to my experience over the years, a large part of the big data development I have come into contact with has been transferred from Jave Web development (of course, not absolutely. I have even seen the product transferred to big data's development, which is bad for a day).

First, because the essence of big data is nothing more than the calculation, query and storage of massive data, background development is easy to come into contact with the application scenarios of massive data access.

The second is the java language skills, natural advantages, because many of big data's components are developed with java, such as HDFS,Yarn,Hbase,MR,Zookeeper, and so on, if you want to learn in depth, fill in the various holes stepped on in the production environment, you must first learn java and then go to gnaw the source code.

Speaking of gnawing on the source code, by the way, it must be very difficult at the beginning. You need to have a deep understanding of the component itself and the development language. Practice makes perfect. When you get past this stage, when you get used to looking at the source code to solve problems, you will find that the source code really smells good.

Scala

Scala and java are very similar to the language that runs in jvm and can be called seamlessly during development. Most of Scala's influence in big data's field comes from community stars Spark and kafka, which we should all know (I will introduce them in multiple dimensions later), and their strong development has directly led to the popularity of Scala in this field.

Python and Shell

Shell should not be introduced too much, and it is a common skill that programmers must have. Python is more used in the field of data mining and writing some complex daily scripts that are difficult to implement by shell.

Distributed computing

What is distributed computing? Distributed computing studies how to divide a problem that requires a lot of computing power into many small parts, then assign these parts to many servers for processing, and finally synthesize these calculation results to get the final result.

For example, Chestnut is like a team leader breaking up a big project, asking each member of the team to develop a part of it, and finally merge everyone to complete the big project. It sounds simple, but people who have really been involved in the development of large projects must know that there is a lot involved.

For example, how to break up this big project? How are tasks assigned? What if everyone already has a job on hand? What if everyone has different abilities? What if everyone's development progress is different? During the development process, the team member gets sick and asks for a long leave. What about the work at hand? What if the group leader who commanded and urged everyone to work asked for leave? What if there is a problem with the code merge process? What if the project is postponed? What if the project fails in the end?

Think carefully about the above ten deadly questions, in fact, each one corresponds to the problems that may arise in distributed computing. I won't say much about how to think about it. In fact, it is already very obvious. Some people may think that these problems are not important in multi-person development and do not need to be specially considered, but in distributed computing systems, each is a very serious and very basic problem, and a good solution is needed.

Finally, the popular tools for distributed computing are:

Offline tools Spark,MapReduce, etc.

Real-time tools Spark Streaming,Storm,Flink, etc.

We'll talk about the differences between these things and their respective application scenarios later.

Distributed storage

The traditional network storage system uses a centralized storage server to store all data, and the io capacity of a single storage server is limited, which becomes the bottleneck of system performance. At the same time, the reliability and security of the server can not meet the needs, especially for large-scale storage applications.

Distributed storage system is to store data on multiple independent devices. The scalable system structure is adopted, multiple storage servers are used to share the storage load, and location servers are used to locate and store information, which not only improves the reliability, availability and access efficiency of the system, but also is easy to expand.

The above figure is the storage architecture diagram of hdfs. As a distributed file system, hdfs has both reliability and expansibility. Three copies of data are stored on different machines (two in the same rack and one in other racks) to ensure that the data is not lost. The metadata is managed by NameNode, and the cluster can be expanded at will.

There are many hbase,mongoDB,GreenPlum,redis in mainstream distributed databases, etc., there is no difference between good and bad, only whether it is appropriate or not, the application scenarios of each database are different, in fact, direct comparison is meaningless. Later, I will also have articles to explain their application scenarios, principles, architecture and so on.

Distributed scheduling and management

Nowadays, people seem to be very keen to talk about "decentralization", which may be the trend brought by the blockchain. But "centralization" is still very important in big data's field, at least for now.

Distributed cluster management requires a component to allocate scheduling resources to each node, which is called yarn.

There needs to be a component to solve the "lock" problem in a distributed environment, which is called zookeeper

You need a component to record task dependencies and schedule tasks on a regular basis, which is called azkaban.

Of course, these "things" are not the only ones. In fact, there are many alternatives. I only give a few commonly used examples here.

Say a few words.

After answering this question, I'm going to say something else. Recently, after thinking about it for a long time, I am ready to start writing a series of articles to record what I have thought over the years. I feel that I do not know where to start. I drew a mind map at the beginning of the article to determine the general direction. We all know that big data's mainstream technology changes and iterates quickly, and new things will be added constantly, so the content in this picture will continue to be added according to the situation. I will write and decide the details, and you can also give me some suggestions. I will update this picture and the directory below in real time according to what I have written.

About grouping

The grouping of big data components above is actually quite tangled, especially as a programmer with obsessive-compulsive disorder, some components seem to be fine in other groups, and I don't want too many groups to look messy, so the grouping method in the above picture is a little more subjective. The grouping method is certainly not absolute.

For example, message queues like kafka are generally not put together with other databases or file systems like HDFS, but they also have the function of distributed persistent storage, so they are put together. There is also a temporal database such as openTsDB, which is actually based on an application on HBase. I think this thing focuses more on query and how to store it, rather than on storage itself, so it is subjectively placed in the category of "distributed computing and query", as well as OLAP tools in this group.

There are still a lot of the same situation, and you can talk about it if you have any objections.

Purpose

Everyone knows that big data's technology is changing with each passing day, and as a programmer, he must keep learning if he wants to remain competitive. The purpose of writing these articles is relatively simple: first, it can be used as a note to sort out knowledge points; second, it is hoped to help some people understand and learn big data.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.