Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Apache Spark and MySQL to realize data Analysis

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article focuses on "how to use Apache Spark and MySQL to achieve data analysis", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use Apache Spark and MySQL to analyze data.

Apache Spark

Contrary to popular belief, Spark does not need to store all its data in memory, but uses caching to speed up operations (like MySQL). Spark can also run independently without Hadoop, and can run on a single server (even on laptops or desktops), and make full use of all CPU cores. It's really easy to turn it on and use distributed mode. Open master first and run slave on the same node:

Then run Spark worker on any additional nodes (make sure you add hostname to / etc/hosts or use DNS):

Why use Spark instead of MySQL?

MySQL (out of the box) does not perform very well in many tasks. One of the limitations of MySQL is that 1 query = 1 CPU kernel. In other words, even if you have 48 fast cores and a large dataset available, you won't be able to make full use of all the computing power, while Spark can make full use of the CPU kernel.

Another difference between MySQL and Spark is:

L MySQL uses the so-called "write-time mode (schema on write)"-- you need to convert the data into MySQL, and if it's not included in MySQL, you can't query it using sql.

L Spark (and Hadoop/Hive) uses "schema on read"-- such as using a table structure (or other supported input format) at the top of a compressed txt file, as a table; then we can query the "table" with SQL.

In other words, MySQL is responsible for storage + processing, while Spark is only responsible for processing and can directly channel data to external data sets (Hadoop, Amazon S3, local files, JDBC MySQL, or other data sets). Spark supports txt files (compressed), SequenceFile, other Hadoop input formats, and Parquet column storage. Spark is more flexible than Hadoop in this respect: for example, Spark can read data directly from MySQL.

A typical pipeline (pipeline) for loading external data into MySQL is:

1. Decompress (especially the external data compressed into txt files)

2. Use the "LOAD DATA INFILE" command to load it into the storage table of MySQL

Only in this way can we filter / group and save the results to another table.

This leads to extra overhead; in many cases, we don't need "raw" data, but we still need to load it into MySQL.

Why use Spark with MySQL:

Instead, the results of our analysis, such as aggregate data, should be stored in MySQL. It is not necessary to store the analysis results in MySQL, but it is more convenient. Suppose you want to analyze a big data collection (that is, annual sales comparison), you need to show it in the form of a table or chart. Because of the aggregation, the result set will be much smaller, and it will be much easier to store it in MySQL and work with many standard programs.

Real case

An interesting free dataset is the number of pages in Wikipedia (since it was launched in 2008, it has been compressed larger than 1TB). This data can be downloaded (compressed space determines the txt file) and is also available on AWS (limited data sets). Data is aggregated on an hourly basis, including the following fields:

L project (for example, en,fr, etc., usually a language)

L header (uri), encoded in urlencode

L number of requests

L returns the size of the content

(the data field is compiled into the file name, 1 file per hour)

Our goal is to find the top 10 pages in the English version of wiki with the highest number of daily requests, but also to support the search for arbitrary words to facilitate interpretation and analysis. For example, compare the number of article requests for "Myspace" and "Facebook" between 2008 and 2015. If you use MySQL, you need to load it intact into MySQL. All files are distributed by built-in date code. The total size of decompression is larger than that of 10TB. Here are the optional steps (typical MySQL approach):

1. Extract the file and run the "LOAD DATA INFILE" command to enter it into the temporary form:

2. "insert" into the final table for aggregation:

3. Decode the title through url (UDF may be used).

Expensive: decompress and convert the data to MySQL format, most of which will be discarded as loss.

According to my statistics, it takes more than a month to sort out the data over the past 6 years, not including the decompression time, as the table gets larger and the index needs to be updated. Of course, there are many ways to speed up this process, such as loading different MySQL instances, first loading memory tables and then assembling them into InnoDB, and so on.

But the easiest way is to use Apache Spark and Python scripts (pyspark). Pyspark can read out the original compressed txt file, query it with SQL, use filtering, similar urldecode functions, etc., group by date, and then save the result set to MySQL.

The following is the Python script that performs the action:

Spark is used in the script to read the original compressed file (one day at a time). We can use directories as "input" or as a list of files. The format is then converted using resilient distributed datasets (RDD); Python contains lambda function mapping and filtering, which allows us to separate and filter "input files".

The next step is to apply the declare fields; we can also use other functions, such as urllib.unquote to decode the title (urldecode). Finally, we can register the temporary form and use the familiar SQL to complete the grouping.

This script can make full use of the CPU kernel. In addition, even if you don't use Hadoop, it's easy to run in a distributed environment: just copy the files to SparkNFS/ external storage.

The script took an hour and used three box to process a month's data and load the aggregated data onto the MySQL (single instance). We can estimate that it takes about 3 days to load all 6 years of (aggregated) data into MySQL.

You may ask why it is much faster now (and the same example has been implemented). The answer is: the pipes are different and more efficient. In our original MySQL pipeline, the raw data was loaded and took about several months to complete. In this case, we filter, group, and then write only what we need to MySQL at read time.

There is another question: do we really need the whole "pipeline"? Is it possible to simply run an analysis query on "raw" data? The answer is: it is possible, but it may take a 1000-node Spark cluster to work, because the amount of data that needs to be scanned is as high as 5TB (see "supplement" below).

Multithreading performance of MySQL Inserts

By using group_res.write.jdbc (url=mysql_url, table= "wikistats.wikistats_by_day_spark", mode= "append"), Spark initiates multithreaded insertion.

Monitor your work.

Spark provides a web interface to facilitate the monitoring and management of work. The example is as follows: run wikistats.py application:

Result: using Parquet separate format and MySQL InnoDB table

Spark supports Apache Parquet split format, so we can save RDD as a parquet file (which can be saved to a directory when you save it to HDFS):

We store the pipeline results (aggregate data) in Spark. This time, daily partitions ("mydate=20080101") are used, and Spark can automatically discover partitions in this format. Make a query after you get the results. Suppose we want to find the top 10 most frequently queried wiki pages in January 2018. You can query with MySQL (you need to get rid of the home page and search page):

Note that we have used aggregate (data summary) tables instead of "raw" data. We can see that the inquiry took 1 hour and 22 minutes. Since you saved the same result in Parquet (see script), you can now use it in Spark-SQL:

This will use the local version of spark-sql and only 1 host.

It takes about 20 minutes, which is faster than before.

Conclusion

Apache Spark is a good way to analyze and aggregate data, and it's very simple. The reason why I like Spark and other big data and analytical framework is:

L Open Source and active Development

L do not rely on tools, such as input data and output data do not have to rely on Hadoop

L stand-alone mode, quick startup and easy deployment

Massively parallel, easy to add nodes

Support multiple input and output formats; for example, read / write MySQL (Java database connection driver) and Parquet separate formats

However, there are many disadvantages:

L the technology is too new, there will be some bug and illegal behavior. Many mistakes are difficult to explain.

L requires Java;Spark 1.5 to support only Java 7 and above. This also means that extra memory is needed-- it makes sense.

You need to run the task through "spark-submit".

I think Apache Spark is very good as a tool, making up for MySQL's shortcomings in data analysis and business intelligence.

At this point, I believe you have a deeper understanding of "how to use Apache Spark and MySQL to achieve data analysis". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report