What programming foundation do you need to learn from big data? What are the learning steps of big data? 09/20 Update SLTechnology News&Howtos

What programming foundation do you need to learn from big data? What are the learning steps of big data?

2025-09-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

What programming foundation do you need to learn from big data? What are the learning steps of big data? What is big data? Many friends have asked me, what on earth is big data? In a word...

What programming foundation do you need to learn from big data? What are the learning steps of big data?

What is big data?

Many friends have asked me, what on earth is big data? To sum up in one sentence

For friends in non-software industry

According to your usual consumption behavior in supermarkets, gas stations, restaurants and other places, through big data's technology, we can know your current age range, whether you are married, whether you have children, how old your children are, whether you have a fixed house, and what the price of the car is.

For friends in the software industry

At ordinary times, the programs we write run on a machine with limited processing capacity and, of course, limited amount of data. Big data's technology, in fact, can distribute our code on many machines to deal with huge amounts of data in parallel, and then obtain valuable and meaningful information from these massive data.

The basic skills needed to learn big data

The foundation of linux is necessary, at least you need to master the basic operation commands under the linux command line.

Javase fundamentals [including mysql], note that it is javase, not javaee. Knowledge of javaweb is not necessary for engineer big data.

Big data technology plate division

data acquisition

Flume, kafka, logstash, filebeat...

Data storage

Mysql, redis, hbase, hdfs...

Although mysql does not fall into the category of big data, I also list it here, because you can't do without it in your work.

Data query

Hive impala elasticsearch kylin...

Data calculation

Real-time computing

Storm sparkstreaming flink...

Off-line calculation

Hadoop spark...

Other frameworks

Zookeeper...

In fact, to learn from big data is to learn the various frameworks around big data's biosphere.

What programming foundation do you need to learn from big data? What are the learning steps of big data?

Big data's learning steps

Although there are many frameworks listed above, it is not necessary to learn all of them at the beginning of learning, and even at work, they may not all be used.

Here's a rough list of learning steps for various frameworks:

Note: the order listed below is only personal suggestions, which can be adjusted according to the actual situation.

Linux Foundation and javase Foundation [including mysql]

These are the basic skills, it is impossible to learn very proficient at first, at least some basic commands in linux should be familiar with, and will be used when learning various frameworks later, and you will be familiar with them if you use more. Javase suggests that you mainly look at object-oriented, collection, io, multithreading, and jdbc operations.

Zookeeper

Zookeeper is the basis of many big data frames, the Chinese name means zoo, because many of the current big data frame icons are in the shape of animals, so zookeeper can actually manage many big data frames. For this framework, you can mainly master how to build a single node and cluster, as well as how to add, delete, modify and query the nodes of zookeeper under the zkcli client.

Hadoop

At present, the hadoop2.x version is generally used in enterprises, so there is no need to learn the hadoop1.x version. Hadoop2.x mainly consists of three parts.

Hdfs early, mainly learn some commands of hdfs, upload, download, delete, move, view and other commands.

Mapreduce this needs to focus on learning, to understand the principle of mr and code implementation, although the real work to write mr code is very few times, but the principle should be understood.

You only need to know that yarn is a resource scheduling platform, which is mainly responsible for allocating resources to tasks. Yarn can schedule resources not only for mapreduce tasks, but also for spark tasks. Yarn is a common resource scheduling platform, and all frameworks that meet the conditions can use yarn for resource scheduling.

Hive

Hive is a data warehouse, all the data is stored on hdfs, the specific difference between [data warehouse and database] you can go to the Internet to search, there are many introductions. In fact, if you are familiar with the use of mysql, it is much easier to use hive. The main purpose of using hive is to write that hql,hql is the SQL language of hive, which is very similar to the sql of mysql database. You can mainly understand some of the syntax features of hive when you follow up to learn hive. In fact, hive is executing hql, but the underlying layer is still executing the mapredce program when it is executed.

Note: in fact, hive itself is very powerful, and the design of the data warehouse is also very important in the work, but in the early study, you should first learn how to use it. You can take a good look at hive later.

Hbase

Hbase is a nosql database, a key-value type database, and the underlying data is stored on hdfs. When learning hbase, I mainly master the design of row-key and the design of column clusters. It should be noted that hbase query based on rowkey is very efficient and can achieve second-level query, but query based on the columns in the column cluster, especially when combining queries, if the amount of data is very large, the query performance will be very poor.

Redis

Redis is also a nosql database and a key-value type database, but this database is purely based on memory, that is, the data in the redis database is stored in memory, so one of its characteristics is that it is suitable for fast read and write application scenarios. Read and write can reach 10W times per second, but it is not suitable for storing massive data. After all, the memory of the machine is limited.

Of course, redis also supports clustering and can store large amounts of data. When learning redis, we should mainly master the difference and use of these data types of string,list,set,sortedset,hashmap, as well as the pipeline pipeline, which is very useful when loading data in bulk, as well as the transaction transaction function.

-flume

Flume is a log collection tool, this is quite commonly used, the most common is to collect the data in the log files generated by the application. Generally, there are two processes, one is that flume collects data and stores it in kafka for real-time processing using storm or sparkstreaming later. Another process is that the data collected by flume is dropped to hdfs for offline processing using hadoop or spark later. In fact, when learning flume, the main thing is to learn to read the documents on the official website of flume and learn the configuration parameters of various components, because using flume is to write all kinds of configurations.

-kafka

Kafka is a message queue that acts as an intermediate buffer layer, for example, flume- > kafka- > storm/sparkstreaming, in scenarios that are often used in real-time processing. Learning kafka mainly grasps the concepts and principles of topic,partition,replicate and so on. Storm

Storm is a real-time computing framework. The difference between hadoop and hadoop is that hadoop deals with large amounts of offline data, while storm processes every new piece of data in real time, which can ensure the timeliness of data processing. Learning storm mainly studies the writing of topology, the adjustment of storm parallelism, and how storm integrates kafka real-time consumption data.

-spark

Spark is also developing very well now, and it has also developed into a biosphere. Spark contains a lot of technologies, spark core,spark steaming,spark mlib,spark graphx.

The spark ecosystem contains offline processing spark core and real-time processing spark streaming. Note here that storm and spark streaming are both real-time processing frameworks, but the main difference is that storm is a real-time processing, while spark streaming is a batch of processing.

There are many frameworks in spark, so you can mainly learn spark core and spark streaming at the beginning of your study. This is usually used by people who work with big data. Spark mlib and spark graphx can wait until later work is needed or have time to study.

-elasticsearch

Elasticsearch is a full-text search engine suitable for real-time query of massive data and supports distributed clusters. In fact, the underlying layer is based on lucene. Fast fuzzy query and count,distinct,sum,avg operation are supported when querying, but join operation is not supported.

Elasticsearch also has a biosphere. Elk (elasticsearch logstash kibana) is a typical solution for log collection, storage, and quick query of charts. When learning elasticsearch, I mainly learn how to use es to add, delete, modify and query, the concept of index,type,document in es, and the design of mapping in es.

At present, let's list so much for the time being. At present, there are still many good technical frameworks in big data's biosphere, which need to be expanded after everyone's work in the future.

In fact, the above listed more than a dozen frameworks, in the study, to specifically pick one or two to focus on the study, it is best to focus on the underlying principles, optimization, source code and other parts, so that you can stand out in the interview process. Don't think about mastering every framework, it's not realistic at present, even if

If you can generally use the above framework, and study one or two frameworks more deeply, in fact, you will be able to find a satisfactory job for big data.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.