Quick wit ~ use Spark core to complete & quotswitch ETLTIX quotswitch! 07/08 Update SLTechnology News&Howtos

Quick wit ~ use Spark core to complete & quotswitch ETLTIX quotswitch!

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Background introduction:

Today, I received a small task assigned by my boss: to develop a program to extract data from the database and generate reports (this is a function of our database audit platform ready to go online). Since you want to generate a report, you must first have data, so you want to extract the data from the business table from the test environment of the business system and load it into the Mysql on your CVM.

I thought that as long as "select... into outfile" and "load data infile..." It could be done with two orders, but there was an accident. Exported by the test environment

When the txt file is in the load of the CVM, a warning such as "Row 1 doesn't contain data for all columns" is reported, and the data in the table is naturally messy and incomplete.

After careful analysis, it seems that there may be something wrong in two aspects:

1. Because the network segment of the test environment is isolated, in order to get the data generated when "select... into outfile", I open the log of CRT and execute

"cat xxx.txt", which acquires the data locally in disguise, and then uploads it to the

two。 The Mysql of the test environment is inconsistent with the minor version of Mysql on the cloud host.

Neither of these problems seems to be solved. Now there are only text files. What should I do? Just use Spark!

I have also written a blog post that uses Spark to analyze Mysql slow logs. I am familiar with various operators of Spark core, so I decided to give it a try.

Actual combat exercise:

The table structure is as follows:

The text structure is as follows:

1147 90100002700021437455 1 100100 _ cashier account information; 001003 _ accident certificate; 001001 _ driver's license; 100000 _ cashier × × × Ming; 001002 _ claim form 11-16 12:08:08 2017-11-16 12:08:08

Looking at the structure of the text, we can see that there are several spaces between each field, and the number of spaces between the two fields is not the same, so we have to use Spark core to extract the fields in the text for subsequent insertion.

Cut the gossip and go straight to the program! (the following programs are written and executed in eclipse ide for scala using scala)

Package cn.spark.study.sqlimport org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport scala.collection.mutable.ArrayBufferimport java.sql.DriverManagerobject insert2Mysql {def main (args: Array [String]): Unit = {val t1=System.nanoTime () val conf = new SparkConf () .setAppName ("insert2Mysql") .setMaster ("local") val sc = new SparkContext (conf) / / textFile method can only read files with utf-8 character set, otherwise Chinese characters will be garbled. Under windows, when saving the file as, you can choose the utf-8 character set / / or convert it in the code, but it is rather tedious to val lines = sc.textFile ("D://Users//GAOZHONGZHENG186//Desktop//text001.txt", 1). Val words = lines.map {line = > line.split (")} val wordsNotNull = words.map {word = > val wordArray_raw = new ArrayBuffer [String] () val wordArray = new ArrayBuffer [String] () for (ie.printStackTrace} Finally {conn.close}} val t2=System.nanoTime () / / print program run time println ((t2-t1) / 1000000000 + "s")}}

During insertion, the first record always reports an error (subsequent statements are inserted normally). When the insert statement printed in eclipse is manually pasted into mysql execution, the same error is still reported:

From the point of view of the error report, you have encountered bug, and there is a problem with the value of 1147. Put the adjacent statements into Notepad for comparison:

As can be seen from the figure, the 1 on the thousand bits of 1147 has indeed changed abnormally, while 1148 in the second statement is normal. It is speculated that some unknown bug may have caused the abnormal change in the first record. This conjecture was confirmed later: when 1147 lines were deleted from the text (when 1148 was the first record of the behavior), 1148 lines reported the same error, and subsequent statements could be inserted normally.

Because the data is for analysis, it won't hurt to lose one, and this bug is really weird, so I won't go any further here.

Careful children's shoes should ask after looking at the code: how efficient is the data insertion? To tell the truth, the efficiency is very poor! 5000 pieces of data have been used for nearly half an hour, and even in such an OLAP scenario, this efficiency is intolerable!

A careful study of the code shows that when RDD calls the foreach method to insert, each record has to create a connection, and each time insert triggers a commit operation in Mysql (the autocommit parameter is turned on by default), these are resource-consuming operations, and the insertion efficiency is naturally very poor.

After finding these problems, changes were made to the code:

Package cn.spark.study.sqlimport org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport java.sql.DriverManagerimport scala.collection.mutable.ArrayBufferobject insert2Mysql {def main (args: Array [String]): Unit = {val t1=System.nanoTime () val conf = new SparkConf () .setAppName ("insert2Mysql") .setMaster ("local") val sc = new SparkContext (conf) / / textFile method can only read files with utf-8 character set, otherwise Chinese characters will be garbled. Under windows, when saving the file as, you can choose the utf-8 character set / / or convert it in the code, but it is rather tedious to val lines = sc.textFile ("D://Users//GAOZHONGZHENG186//Desktop//text01.txt", 1). Val words = lines.map {line = > line.split ("")} val wordsNotNull = words.map {word = > val wordArray_raw = new ArrayBuffer [String] () val wordArray = new ArrayBuffer [String] () for (i select count (*) from claims_case_loss_document +-+ | count (*) | +-+ | 4999 |-- the first entry was skipped during insertion. So it is 4999 + + 1 row in set (0.00 sec) mysql > select * from claims_case_loss_document limit 1\ Gmail * 1. Row * * id: 1148 case_id: 90100002700021437450 case_times: 1document_list: 100100_ recipient account information 001003 _ accident certificate; 001001 _ driver's license; 100000 _ cashier × × × Ming; 001002 _ claim form create_time: 2017-11-16 12:08:08 update_time: 2017-11-16 12 rides 0815 row in set (0.00 sec)

At this point, the problem has been satisfactorily solved! The whole process is very similar to ETL in the field of data warehouse, extract-transform-load, all three steps are involved, but not used.

It's just a tool like kettle.

Summary:

In the era of big data, DBA should actively make changes and master certain development skills in order to better adapt to the changes of the times.

Finally, advertise the database audit platform developed by the Shanghai Group.

The database audit platform is a product built by our group over the past two years. It can be used in Mysql, Oracle, Postgres and other databases, with the following core capabilities:

1. Audit violation sql, the front end generates a report with one click

two。 The sql of the same function point can be classified automatically, which is convenient for subsequent unified rectification.

3. Embedded Percona toolkit, which can be called with one button at the front end.

4. Grab inefficient sql with one click, and automatically give optimization suggestions.

There are a lot of cool features will not be introduced one by one, in short, who uses who says good! Interested DBA children's shoes can leave a message, can be free trial Oh!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.