How to analyze big data in Spark 07/13 Update SLTechnology News&Howtos

How to analyze big data in Spark

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to carry out Spark big data analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Background

The previous basic tutorial on Flask is over for the time being. Because what is involved at present can already meet my needs. The rest is modified and optimized according to requirements. Nothing more is involved. So I usually maintain, let's talk about Spark, why should we talk about Spark, because our performance testing requirements to create 1 billion or more data. The normal way certainly won't work, and you have to use spark to submit it to yarn to run. So now let's talk about big data. At the same time, big data is also the foundation of artificial intelligence. Now doing something about big data will also pave the way for the future discussion of artificial intelligence testing.

Origin

Everything is difficult at the beginning. When I first came into contact with big data, I looked confused every day. Because I have only dealt with databases before, I have never heard of the hadoop biosphere at all. I can't understand it at all when I read the materials. So let me first introduce the basic concepts.

Big data, first of all, you must be able to save big data. Although the traditional database derived from the master-slave, slicing. But they are also unable to cope with the amount of data in TB or even PB in storage, especially in computing processing, they can not break through the shackles of stand-alone computing. Because our traditional file system is stand-alone and cannot span different machines. In previous years, with the rise of the Internet, we have entered the era of data explosion. Traditional data storage methods have been increasingly unable to keep up with the speed of data development in terms of storage capacity and computing performance. At that time, there was a great demand in the industry to develop a new way to deal with data. Then, in 2004, Google published a paper, MapReduce, detailing the distributed computing principles of Google. At this time, the industry found that the original data can still be played in this way, but Google's conscience was greatly broken. He only published a paper but did not open source, which made a bunch of people eagerly scratch their ears and scratched their cheeks. Later, Apache organized a group of people to paste out a hadoop according to Google's paper. Until now, hadoop ecology has been developed for more than 10 years (yes, we are now looking at a very high-end hadoop technology is the rest of other people's Google games).

The emergence of Hadoop

It was previously said that our traditional file system is stand-alone and cannot span different machines. The emergence of HDFS (Hadoop Distributed FileSystem) breaks the limitation of our stand-alone machine. HDFS is a distributed file system specially developed by Apache. The design is essentially so that a large amount of data can span hundreds of machines, but what you see is a file system rather than multiple. For example, I want to fetch the data on / hdfs/gaofei/file1. You are referring to a file path, but the actual data is stored on many different machines. As users, we don't know the actual physical storage structure, all we know is the logical path exposed to us. So after we have the ability to have such a large amount of data, we begin to think about how to deal with the data. Although HDFS helps us manage data on different machines and abstract a unified interface to us. But this still does not change the fact that these data are very large. If we are still processing this huge amount of data on a single machine, the performance is still unacceptable. So if we want to process this data on multiple machines at the same time, we are faced with the problem of how to communicate and schedule between machines. This is what MapReduce/Spark does. MapReduce is the product of the first generation, and the hadoop developed by Apache is based on the MapReduce framework (based on the paper of Google). Spark is the second generation. MapReduce adopts a very simplified model, and there are only two calculation processes: Map and Reduce (in series with shuffle).

MapReduce

So what is MapReduce? take the most commonly used example of wordcount. Suppose you need to count the frequency of all the words in a huge file. First of all, you need many machines to read all parts of the file concurrently and calculate the parts you read in the first step. If I read some of the data on this machine, I get results like (Hello--100 times) (word--1000 times) for these data. Each machine reads some of the data and does the same thing. This is the Map stage in MapReduce (well, there are actually other operations in the middle, but I'm sorry I'm not good at it, so I can't explain it clearly). Then we enter the Reduce phase, which also starts a lot of machines concurrently, and the framework will put the data on the Map machines on these Reduce machines according to certain rules. For example, we put all the words Hello on ReduceA, and all the data on the word word on ReduceB. ReduceA then summarizes the results of the word Hello in all the Map data and calculates that the word appears 1000 times in the data. ReduceB summarizes all the words Word and calculates the frequency of its occurrence in the data at 1000 times. In this way, we have counted the word frequency of this huge document. This is MapReduce, which can be simply understood as the first step of reading different data blocks by the concurrent machines in the Map phase, and then the concurrent machines in the Reduce phase summarize the data of the Map phase according to the rules to do the second part of processing. There is a very important process in the middle is shuffle, which can be understood for the time being as the regular process of which Map's data is put on which Reduce. The details don't mean, shuffle is a little complicated, we'll talk about it later.

Spark

MapReduce's model is simple and violent, but the program is really troublesome to write. Because it all depends on programmer coding, the framework only provides functions of Map and Reduce, and it is up to you to write the logic in it. So there are pig and Hive. I don't know much about Pig. Hive is based on SQL, and they translate SQL into MapReduce programs. With Hive, people find that Sql is too easy to write, which is much more convenient than writing java code. For example, our products, there is a special operator is sql, you can let business staff also sql to do the action of the table. But we found that Hive runs very slowly on MapReduce, which is really unacceptable. So after the evolution of several engines, Spark and SparkSQL came into being. Spark not only has a new generation of computing engine (running faster), but also has a lot of built-in methods for you to manipulate data, and our programs are now faster and easier. If we have such a need, count the number of words that appear in a document with the letters aforme b. You can write something like this:

NumAs and numBs are the results of our statistics. You can see that spark provides a filtering function like filter and a built-in statistical quantity function like count. We no longer have to write as much logic as MapReduce did before. At the same time, SparkSQL also supports our ability to translate SQL into code.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.