How to implement event logs scrolling in Spark 3.0 04/18 Update SLTechnology News&Howtos

How to implement event logs scrolling in Spark 3.0

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge of "how to achieve event logs scrolling in Spark 3.0". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Event log scrolling

You must first use Spark 3.0 and set spark.eventLog.rolling.enabled to true (the default is false). Then when writeEvent, Spark will determine whether the total size of the event log file currently being written and the new event log size is greater than the value configured by the spark.eventLog.rolling.maxFileSize parameter, and will start the event log roll operation if it meets the requirement.

Event log compression

The so-called event log compression is to merge multiple scrolling event log files into a single compressed file. Parameters involved in log compression are spark.history.fs.eventLog.rolling.maxFilesToRetain and spark.history.fs.eventLog.rolling.compaction.score.threshold. The first parameter means how many event logs uncompressed states need to be saved after Compaction. The default value for this parameter is Int.MaxValue. That is, event log Compaction is not enabled by default, and all event logs will not be Compaction into a file.

What to pay attention to:

The Compaction operation of event logs is performed on the Spark history server, and when the Spark history server detects that a new event log is written to the directory configured by the spark.eventLog.dir parameter, the event logs corresponding to the Spark job may perform compact operation.

Event logs's Compaction operation may delete some useless event logs, see the next summary for the logic of deletion. After the Compaction operation, the newly generated compressed file size will be smaller.

A Spark job will have at most one Compact file with a .compact suffix. There are already merged files after compact that will be read out the next time compact is performed and the files that need to be merged by compact will be merged again and then written to the new compact file.

Event logs that has been selected for compact will be deleted after the compact is executed.

Core thought

The entire event logs scrolling project should be roughly divided into two phases:

The first phase is to support event logs scrolling and event logs Compaction, which in SPARK-28594 has been incorporated into Spark 3.0 code.

The second phase is to use the method used by AppStatusListener to persist event logs into the underlying KVStore and support event logs restore from KVStore, which can be found in SPARK-28870, which is still under development.

Stage one

The hidden meaning of supporting event log scrolling is to support the deletion of old event logs, otherwise scrolling alone does not support deletion, but only solves the size of a single event log file and cannot solve the total event log size of the entire job.

To ensure that the event logs can still be replayed by the Spark history server after deleting the old event logs, we need to define which event logs can be deleted.

In the case of Streaming jobs, each batch runs a different job. If we want to delete some event logs, in most cases, we will save the event logs for the most recent batch of jobs, because these event logs help us analyze the problems we just encountered. In other words, it is safer and more feasible to delete event logs for older jobs. This also applies to SQL query assignments.

Currently, Spark maintains some information such as liveExecutors, liveRDDs, liveJobs, liveStages and liveTasks in memory. When the Spark history server triggers the compact operation, it reads the event log files that require compact, and then determines which events need to be deleted and which events need to be retained based on the previous information such as liveExecutors, liveRDDs, liveJobs, liveStages and liveTasks. The Event that meets the definition of EventFilter will be retained, and if it is not satisfied, it will be deleted. For more information, please see the applyFilterToFile method implementation of EventFilter.

Stage two

AppStatusListener uses external KVStore to store event logs, so the community recommends that you take advantage of existing features to retain the maximum number of Jobs, Stages, and SQL implementations in the underlying KVStore. In order to store objects to KVStore and recover objects from KVStore, the community uses the method of dump objects stored in KVStore into a file, which is called snapshot.

From a space usage point of view, this idea is very effective, because in POC, you only need 5MB memory to dump the data in KVStore to a file, which plays back 8.4GB 's event log. The results seem surprising but meaningful, and according to this mechanism, the amount of memory required from dump data to files is unlikely to change significantly in most cases.

It is important to note that the contents of the snapshot are different from those of the current event log file. Because the snapshot file is from KVStore dump, these objects are not written in the order in which they were created. We can compress these objects to save space and IO costs. When analyzing a problem, the newly generated event log may be useless, which requires reading and manipulating the previous event log file. To support this situation, you need to write the basic listener events in the original format, then scroll the event log file, and then save the old event log to a snapshot, and the files in both formats coexist. This will meet our previous needs. Due to the existence of snapshots, the overall size of the event log does not grow indefinitely.

How is event log stored in the new scheme?

Previously, the event log of each Spark job is saved in a single file, and if the event log is not completed, it is indicated by the .inprogress suffix. The new event log scenario creates a directory for each Spark job to save, because each Spark job may generate multiple event log files. The folder name of the event log is in the format: eventlog_v2_appId (_). The event log of the corresponding job is stored in the event log folder, and the log file name is in the format: events__ (_) (.).

To illustrate, I've done some tests here. The / data/iteblog/eventlogs directory is the value set by the spark.eventLog.dir parameter. Here's what's in this directory:

Iteblog@www.iteblog.com:/data/iteblog/eventlogs | ⇒ ll total 0 drwxrwx--- 15 iteblog wheel 480B 3 9 14:26 eventlog_v2_local-1583735123583 drwxrwx--- 7 iteblog wheel 224B 3 9 14:50 eventlog_v2_local-1583735259373 iteblog@www.iteblog.com:/data/iteblog/eventlogs/eventlog_v2_local-1583735259373 | ⇒ ll total 416-rw-r--r-- 1 iteblog wheel 0B 3 9 14:27 appstatus_local-1583735259373.inprogress-rwxrwx -1 iteblog wheel 64K 3 9 14:50 events_2_local-1583735259373.compact-rwxrwx--- 1 iteblog wheel 102K 3 9 14:50 events_3_local-1583735259373-rwxrwx--- 1 iteblog wheel 374B 3 9 14:50 events_4_local-1583735259373

As you can see, when the Spark job is not completed, there will be an empty file for appstatus_local-1583735259373.inprogress, and the real event log will be written to the events_x_local-1583735259373 file.

This is the end of "how to achieve event logs scrolling in Spark 3.0". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.