Using spark to analyze mysql slow logs 07/06 Update SLTechnology News&Howtos

Using spark to analyze mysql slow logs

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Children familiar with Oracle know that in Oracle, there are many views that record the various indicators of SQL execution. We can write corresponding scripts according to our own needs to obtain the performance cost of SQL from Oracle. As an open source database, mysql is not slower than oracle, and sql can only be analyzed through slow.log. slow.log doesn't seem intuitive enough, and if you execute the same slow sql multiple times, it will be recorded multiple times in slow.log, making it less readable.

Recently, the database audit platform developed by the department launched mysql audit module, which needs to provide customers with the function of extracting slow sql from slow.log. Since I have studied spark before, after analyzing the text structure of slow log, I use scala language and spark core related technology to write Mini programs that can deduplicate sql in slow.log and input top sql sorted by execution time into hive table.

Without further ado, serve!

Development environment:

1、CentOS 6.5

2、JDK 1.7

3、Hadoop 2.4.1

4、Hive 0.13

5、Spark 1.5.1

6、scala 2.11.4

Hadoop and spark cluster environment building methods do not say much ha, a lot of online information, interested in big data children can try to build.

Step 1 Writing applications using scala ide for eclipse

analyzeSlowLog.scala:

package cn.spark.study.sqlimport org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport scala.util.matching.Regeximport scala.collection.mutable.ArrayBufferimport org.apache.spark.sql.types.StructTypeimport org.apache.spark.sql.types.StructFieldimport org.apache.spark.sql.types.StringTypeimport org.apache.spark.sql.types.DoubleTypeimport org.apache.spark.sql.SQLContextimport org.apache.spark.sql.Rowimport org.apache.spark.sql.hive.HiveContextobject SlowLogAnalyze { def main(args: Array[String]): Unit = { //Create SparkConf,SparkContext and HiveContext val conf=new SparkConf() .setAppName("SlowLogAnalyze"); val sc=new SparkContext(conf) val hiveContext=new HiveContext(sc) //read hdfs file, get logRDD val logRDD=sc.textFile("hdfs://spark1:9000/files/slow.log", 5) //Create a regular expression to filter invalid information in slow.log val pattern1="# Time:".r val pattern2="# User@Host:".r val pattern3="SET timestamp=".r //filter logRDD to filter invalid information val filteredLogRDD=logRDD.filter { str => //Regular returns option type, only Some and None type if(pattern1.findFirstIn(str)!= None){ false }else if(pattern2.findFirstIn(str)!= None){ false }else if(pattern3.findFirstIn(str)!= None){ false }else{ true } } /** * Convert filteredLogRDD to tuple class RDD KV_RDD of format (execute_time, sql_text) */ //convert filteredLogRDD to array val logArray=filteredLogRDD.toArray() //defines a regular expression pattern for identifying Query_timeval pattern="# Query_time:".r //define array KV_Array to store tuple after circular mapping, tuple is (query_time line, sql_text) val KV_Array=ArrayBuffer[(String,String)]() for (iRow(tuple._ 2,tuple._ 1)} val top10Array=sortedRowRDD.take(10) val top10RDD=sc.parallelize(top10Array, 1) //Convert sortedRDD to dataframeval structType=new StructType(Array( StructField("sql_text",StringType,true), StructField("executed_time",DoubleType,true) ) ) val top10DF=hiveContext.createDataFrame(top10RDD, structType) hiveContext.sql("drop table if exists sql_top10") top10DF.saveAsTable("sql_top10") }}

Make the code into jar packages and upload them to linux.

Step 2 Writing an Execution Script

analyzeSlowLog.sh：

/var/software/spark-1.5.1-bin-hadoop2.4/bin/spark-submit \--class cn.spark.study.sql.SlowLogAnalyze \--num-executors 3 \--driver-memory 100m \--executor-memory 100m \--executor-cores 3 \--files /var/software/hive/conf/hive-site.xml \--driver-class-path /var/software/hive/lib/mysql-connector-java-5.1.17.jar \/var/software/spark_study/scala/SlowLogAnalyze.jar

Step 3 Execute analyzeSlowLog.sh and enter hive to view the analysis results:

hive> show tables;

daily_top3_keywords_uvs

good_students

sql_top10 --This table is the table name defined in the scala program, which is created in hive when the program runs

student_infos

student_scores

Time taken: 0.042 seconds, Fetched: 5 row(s)

View the contents of sql_top10:

Because of the length limit, the sql text is truncated, so it looks like part of sql is the same, but it is actually two different sql (where the conditions are different).

hive> select substr(sql_text,1,50),executed_time from sql_top10;

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

...

Execution completed successfully

MapredLocal task succeeded

select 'true' as QUERYID, ID_GARAG 0.0252804

select count() from pms_garage_vitri_info 0.0048902

select count() from information_schema.PROCESSLIS 3.626E-4

select 'true' as QUERYID, e_survey 2.39E-4

select 'true' as QUERYID, e_survey 2.34E-4

SELECT account_code AS um 2.2360000000000001E-4

select 'true' as QUERYID, e_survey 2.19E-4

select 'true' as QUERYID, e_survey 2.18E-4

select 'true' as QUERYID, e_survey 2.15E-4

SELECT account_code AS um 2.1419999999999998E-4

Time taken: 8.501 seconds, Fetched: 10 row(s)

At this point, the extraction of mysql slow.log is complete!

Thoughts on creating related views in mysql:

Hadoop and spark are generally used for processing big data, and the slow logs used to process mysql here are overkill. However, to provide a v$Topsql view of database top sql in mysql, real-time analysis of slow.log is necessary, and spark streaming comes in handy.

The train of thought is as follows:

1. Write crontab timed tasks to copy slow.log to hdfs on a timed basis

2. Write crontab timer task to invoke spark streaming program to analyze the latest slow.log on hdfs-> output top sql to v$Topsql view in corresponding mysql database through jdbc and overwrite previous data.

ps: When parsing slow.log, fields such as executor and timestamp (not extracted in this article) can be executed in the program to provide more detailed information.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.