[summary] Spark Optimization (1)-Multi-Job concurrent execution 04/28 Update SLTechnology News&Howtos

[summary] Spark Optimization (1)-Multi-Job concurrent execution

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

A Job in the Spark program is triggered by an Action operator, such as count (), saveAsTextFile (), etc.

In this Spark optimization test, we read the data from the Hive and save another four copies, of which two Job are serial and the other two Job are parallel. Submit the task to Yarn for execution. The performance of serial and military line processing can be clearly seen.

Each Job execution time:

JobID start time end time elapsed Job 016:59:4517:00:3449sJob 117:00:3417:01:1339sJob 217 01R 1517R 01R 55

40sJob 317:01:1617:02:1256s

All four Job perform the same operation by themselves, the Job0,Job1 group uses serial mode, and Job2,Job3 adopts parallel mode.

The time consuming in Job0,Job1 serial mode is equal to the sum of two Job time consuming 49s+39s=88s.

The time consuming of Job2,Job3 parallel mode is equal to the difference between the first start time and the last end time, only 17:02:12-17 01VR 151557s.

Code:

Package com.cn.ctripotb;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.sql.DataFrame;import org.apache.spark.sql.hive.HiveContext;import java.util.*;import java.util.concurrent.Callable;import java.util.concurrent.Executors;/** * Created by Administrator on 2016-9-12. * / public class HotelTest {static ResourceBundle rb = ResourceBundle.getBundle ("filepath") Public static void main (String [] args) {SparkConf conf = new SparkConf () .setAppName ("MultiJobWithThread") .set ("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); JavaSparkContext sc = new JavaSparkContext (conf); HiveContext hiveContext = new HiveContext (sc.sc ()); / / release final DataFrame df = getHotelInfo (hiveContext) when testing real data / / without multithreading, execute two Action operations successively to generate two Job df.rdd () .saveAsTextFile (rb.getString ("hdfspath") + "/ file1", com.hadoop.compression.lzo.LzopCodec.class); df.rdd () .saveAsTextFile (rb.getString ("hdfspath") + "/ file2", com.hadoop.compression.lzo.LzopCodec.class) / / use Executor to implement multithreaded processing Job java.util.concurrent.ExecutorService executorService = Executors.newFixedThreadPool (2); executorService.submit (new Callable () {@ Override public Void call () {df.rdd () .saveAsTextFile (rb.getString ("hdfspath") + "/ file3", com.hadoop.compression.lzo.LzopCodec.class); return null ); executorService.submit (new Callable () {@ Override public Void call () {df.rdd () .saveAsTextFile (rb.getString ("hdfspath") + "/ file4", com.hadoop.compression.lzo.LzopCodec.class); return null;}}); executorService.shutdown () } public static DataFrame getHotelInfo (HiveContext hiveContext) {String sql = "select * from common.dict_hotel_ol"; return hiveContext.sql (sql);}}

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.