Comprehensive Analysis of big data batch processing Framework Spring Batch 07/12 Update SLTechnology News&Howtos

Comprehensive Analysis of big data batch processing Framework Spring Batch

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Nowadays, the discussion of micro-service architecture is in full swing. But in the enterprise architecture, in addition to a large number of OLTP transactions, there are also a large number of batch transactions. In financial institutions such as banks, there are 30,000 to 40,000 batch jobs to be processed every day. For OLTP, the industry has a large number of open source frameworks, excellent architectural design to support, but the framework in the field of batch processing is indeed rare. It's time to join us to learn about the best frameworks and designs in the world of batch processing. Today I'm going to take Spring Batch as an example to explore the world of batch processing with you.

Batch processing typical business scenarios

Reconciliation is a typical batch business scenario, the transactions of various financial institutions and cross-host systems will involve the process of reconciliation, such as large and small payments, UnionPay transactions, Bank of China transactions, cash management, POS business, ATM business, securities company capital accounts, securities companies and securities clearing companies.

The following are some of the end-of-the-day run-batch scenario requirements for a certain line of online banking.

The demand points involved include:

Error handling and fallback are required for each unit in batch

Each unit runs on a different platform

Branch selection is required

Each unit needs to monitor and obtain the log processing unit.

Provide a variety of trigger rules, by date, calendar, cycle trigger

In addition, typical batch processing is suitable for the following business scenarios:

Submit batch tasks on a regular basis (day-end processing)

Parallel batch processing: parallel processing tasks

Enterprise message-driven processing

Large-scale parallel processing

Manual or scheduled restart

Process dependent tasks sequentially (expandable to workflow-driven batches)

Partial processing: ignore records (for example, on rollback)

Complete batch transaction

Different from OLTP type transactions, batch jobs are characterized by batch execution and automatic execution (unattended): the former can handle the import, export and business logic calculation of large quantities of data, while the latter can automate batch tasks without human intervention.

In addition to paying attention to its basic functions, there are a few points to pay attention to:

Robustness: no program crashes due to invalid or incorrect data

Reliability: reliable execution of batch jobs through tracking, monitoring, logging and related processing strategies (retry, skip, restart)

Scalability: vertical and horizontal expansion of applications through concurrent or parallel technologies to meet the performance requirements of massive data processing

Suffering from the lack of a better batch framework in the industry, Spring Batch is one of the few excellent batch frameworks in the industry (Java language development). SpringSource and Accenture (Accenture) have contributed wisdom.

Accenture has extensive industry-level experience in batch architecture, contributing to previously dedicated batch architecture frameworks (which have been developed and used for decades and provide a great deal of reference experience for Spring Batch).

SpringSource has a deep technical understanding and Spring framework programming model, and draws lessons from the language features of JCL (Job Control Language) and COBOL. In 2013, JSR-352 incorporated batch processing into the specification system and was included in JEE7. This means that all JEE7 application servers will have batch processing capabilities, and currently the first application server to implement this specification is Glassfish 4. Of course, it can also be used in Java SE.

But the most critical point is that the JSR-352 specification draws lessons from the design ideas of the Spring Batch framework. From the core models and concepts in the above figure, we can see that the core conceptual models are exactly the same. The complete JSR-252 specification can be downloaded from https://jcp.org/aboutJava/communityprocess/final/jsr352/index.html.

Through the Spring Batch framework, we can build lightweight and robust parallel processing applications that support transaction, concurrency, process, monitoring, vertical and horizontal expansion, and provide unified interface management and task management.

The framework provides core competencies such as the following to make people pay more attention to business processing. What's more, it provides the following rich capabilities:

Identify the execution environment and application of separate batch processing

Provide common core services in the form of interfaces

Provides a simple default core execution interface "out of the box"

Provide configuration, customization, and extension services in the Spring framework

All core services implemented by default can be easily extended and replaced without affecting the base layer.

Provide a simple deployment model that compiles with Maven

Batch processing key domain model and key architecture

Let's start with an example of Hello World, a typical batch job.

A typical job is divided into three parts: job reading, job processing, and job writing, which is also a typical three-step architecture. The whole batch framework basically revolves around Read, Process, and Writer. In addition, the framework provides a job scheduler and a job warehouse (which is used to store the metadata information of Job and supports memory and DB modes).

The complete domain conceptual model is shown in the following figure:

Job Launcher (Job Scheduler) is the ability to run Job provided by the Spring Batch framework infrastructure layer. With a given Job name and as Job Parameters, Job can be executed through Job Launcher.

Batch tasks can be invoked in Java programs through Job Launcher, or from the command line or other frameworks (such as the timing scheduling framework Quartz).

Job Repository stores metadata during the execution of Job (the metadata here refers to Job Instance, Job Execution, Job Parameters, Step Execution, Execution Context, and so on) and provides two default implementations.

One is to store it in memory; the other is to store metadata in a database. By storing metadata in a database, you can monitor the execution status of batch Job at any time. Whether the Job execution results in success or failure, and makes it possible to restart Job in the event of a Job failure. Step represents a complete step in a job, and a Job can consist of one or more Step.

The run-time model of the batch framework is also very simple:

Job Instance (Job instance) is a concept of run time, and each execution of Job involves a Job Instance.

There may be two sources of Job Instance: one is to get one from the Job Repository (job repository) according to the set Job Parameters; if you do not get the Job Instance from the Job Repository according to the Job Parameters, a new Job Instance is created.

Job Execution represents a handle to Job execution, and an Job execution may succeed or fail. Only when the Job executes successfully, the corresponding Job Instance will be completed. Therefore, in the case of Job execution failure, there will be a scenario in which a Job Instance corresponds to multiple Job Execution.

This paper summarizes the typical conceptual model of batch processing, and its design is very concise, which completely supports the whole framework.

The core competencies provided by Job include abstraction and inheritance of jobs, similar to object-oriented concepts. For executing abnormal jobs, provide the ability to restart.

At the Job level, the framework also provides the concept of job scheduling, including sequence, conditions, and parallel job scheduling.

Configure multiple Step in one Job. Different Step can be executed sequentially or selectively according to different conditions (conditions are usually determined by the exit status of Step), and jump rules are defined through next elements or decision elements

In order to improve the execution efficiency of multiple Step, the framework provides the ability of Step parallel execution (using split for declaration, which usually requires that there is no dependency between Step, otherwise it is easy to cause business errors). Step contains all the necessary information in an actual running batch task, and its implementation can be a very simple business implementation or a very complex business process, and the complexity of Step is usually determined by the business.

Each Step is composed of ItemReader, ItemProcessor and ItemWriter. Of course, according to different business requirements, ItemProcessor can be appropriately streamlined. At the same time, the framework provides a large number of ItemReader, ItemWriter implementation, provides support for FlatFile, XML, Json, DataBase, Message and other data types.

The framework also provides Step with the capabilities of restart, transaction, number of restarts, number of concurrency, as well as commit interval, exception skip, retry, completion strategy, etc. Flexible configuration based on Step can fulfill common business function requirements. Three steps (Read, Processor, Writer) are classic abstractions in batch processing.

As batch-oriented processing, the Step layer provides the ability to read, process, and commit multiple times.

In the operation of Chunk, you can set the number of read records through the property commit-interval and submit them at a time. By setting the interval value of commit-interval, the submission frequency is reduced and the resource utilization rate is reduced. Each commit of Step exists as a complete transaction. The declarative transaction management mode provided by Spring is adopted by default, and the transaction scheduling is very convenient. The following is an example of declaring a transaction:

The framework's ability to support transactions includes:

Chunk supports transaction management and sets the number of records per commit through commit-interval

Support fine-grained transaction configuration for each Tasklet: isolation of boundaries, propagation behavior, timeout

Support for rollback and no rollback, supported by skippable-exception-classes and no-rollback-exception-classes

Transaction-level configuration that supports JMS Queue

In addition, Spring Batch also does a very concise abstraction in terms of the senior model abstraction of the framework.

Only six business tables are used to store all the metadata information (including Job, Step instance, context, executor information, which provides the possibility for subsequent monitoring, reboot, retry, state recovery, etc.).

BATCH_JOB_INSTANCE: job instance table, which is used to store instance information of Job

BATCH_JOB_EXECUTION_PARAMS: job parameter table, which is used to store the parameter information of each Job during execution. This parameter actually corresponds to the Job instance.

BATCH_JOB_EXECUTION: job executor table, which is used to store the execution information of the current job, such as creation time, execution start time, execution end time, Job instance executed, execution status, etc.

BATCH_JOB_EXECUTION_CONTEXT: job execution context table, which is used to store information about the context of the job executor.

BATCH_STEP_EXECUTION: job step actuator table, which is used to store information about each Step executor, such as job step start time, execution completion time, execution status, number of reads and writes, number of skips, and so on.

Realize the robustness and expansibility of the job

Batch processing requires that Job must be robust, usually Job is batch processing data, unattended, which requires to be able to deal with a variety of exceptions and errors during Job execution, and effectively track Job execution.

A robust Job usually requires the following features:

Fault tolerance

For non-fatal exceptions during Job execution, the Job execution framework should be able to perform effective fault-tolerant processing rather than let the entire Job execution fail; usually only fatal exceptions that lead to incorrect business can terminate Job execution.

Traceability

Any error occurred during the execution of Job needs to be effectively recorded to facilitate the effective processing of the error point at a later stage. For example, any rows of records that are ignored during Job execution need to be effectively recorded, and the application maintainer can make effective subsequent processing of the ignored records.

Restartability

If you fail due to an exception during Job execution, you should be able to restart Job; at the point of failure instead of re-executing Job from scratch.

The framework provides features that support all of the above capabilities, including Skip (skip record processing), Retry (retry a given operation), and Restart (restart a failed Job from the error point):

Skip, during the data processing, if the format of some pieces of data can not meet the requirements, you can skip the processing of the row records through Skip, so that Processor can smoothly deal with the rest of the record lines.

Retry, retry the given operation for many times. In some cases, the operation fails due to temporary exceptions, such as network connection exception, concurrent handling exception, and so on. You can avoid a single failure by retrying. The network returns to normal next time, and there are no more concurrent exceptions. This can effectively avoid such temporary exceptions through the ability to retry.

Restart, after the Job execution fails, you can restart the function to continue to complete the Job execution. On restart, the batch framework allows Job to be restarted at the point where the last execution failed, rather than starting from scratch, which greatly improves the efficiency of Job execution.

For extensibility, the extensibility provided by the framework includes the following four modes:

Multithreaded Step multithreading executes a Step

Parallel Step executes multiple Step in parallel through multiple threads

Remote Chunking performs distributed Chunk operations on remote nodes

Partitioning Step partitions the data and executes it separately

Let's first look at the first implementation, Multithreaded Step:

The batch framework uses a single thread to complete task execution by default when Job executes. At the same time, the framework provides thread pool support (Multithreaded Step mode), which can be processed in parallel when Step is executed, where parallelism means that the same Step uses thread pool for execution and the same Step is executed in parallel. You can easily turn a normal Step into a multithreaded Step by using the property task-executor of tasklet.

Example of an implementation of Multithreaded Step:

It should be noted that most of the ItemReader, ItemWriter, and other operations provided by the Spring Batch framework are thread-unsafe.

Thread-safe Step can be exposed in an extended way.

Here is an implementation of the extension:

Requirements: achieve thread-safe Step for batch processing of data tables, and support restart ability, that is, the status of the batch can be recorded at the point of failure.

For the database reading component JdbcCursorItemReader in the example, when designing the database table, add a field Flag to the table to identify whether the current record has been read and processed successfully, and identify Flag=true if the processing is successful. The next time you re-read, you will skip processing the records that have been successfully read and processed successfully.

Multithreaded Step (multithreaded step) provides the ability of multiple threads to execute a Step, but this scenario is not used very much in real business.

More business scenarios are that different Step in Job do not have a clear sequence and can be executed in parallel during execution.

Parallel Step: provides the ability to scale out on a single node

Usage scenario: the two job steps of Step An and Step B are executed by different threads. Step C will not be executed until both of them are executed.

The framework provides the ability to parallel Step. You can define parallel job flows through the Split element and define the thread pool to use.

The execution effect of Parallel Step mode is as follows:

Each job step processes different records in parallel. In the example, three job steps process different data in the same table.

Parallel Step provides horizontal processing on one node, but with the increase of job processing, it is possible that one node can not meet the processing of Job. In this case, we can use remote Step to combine multiple machine nodes to complete the processing of a Job.

Remote Chunking: remote Step technology essentially separates the processing logic of Item read and write; usually the read logic is operated on one node and the write operation is distributed to other nodes for execution.

Remote chunking is a technical segmentation of step, without a clear understanding of the structure of the data to be processed.

Any input source can be read by a single process and sent as a "block" to the remote worker process after dynamic segmentation.

The remote process implements the listener mode, and the feedback request and processing data will eventually return the processing result asynchronously. The transmission between the request and the return is ensured between the sender and the individual consumer.

In the Master node, the job step is responsible for reading the data, and sending the read data to the designated remote node through remote technology for processing. After processing, Master is responsible for recycling the situation executed by the Remote side.

In the Spring Batch framework, the task of remote Step is accomplished through two core interfaces, ChunkProvider and ChunkProcessor.

ChunkProvider: generates batches of ItemReader operations based on a given Chunk operation

ChunkProcessor: responsible for obtaining the Chunk operations generated by ChunkProvider and performing specific write logic

There is no default implementation of remote Step in Spring Batch, but we can use SI or AMQP implementation to achieve remote communication capabilities.

An example of implementing the Remote Chunking pattern based on SI:

The Step local node is responsible for reading the data and sending the request to the remote Step through MessagingGateway; the remote Step provides a listener for the queue to obtain the request information when there is a message in the request queue and hand it over to ChunkHander for processing.

Next, let's take a look at the last partition mode; Partitioning Step: the partition mode requires some understanding of the structure of the data, such as the scope of the primary key, the name of the file to be processed, and so on.

The advantage of this mode is that the processor of each element in the partition can run like a single step of a normal Spring Batch task, without having to implement any special or new patterns to make them easier to configure and test.

The following benefits can be achieved through partitioning:

Partitions enable finer-grained extensions

High performance data segmentation can be realized based on partition.

Partitions are usually more scalable than remote ones

The processing logic after partition supports both local and remote modes

Typically, a partition job can be divided into two processing phases, data partition and partition processing.

Data partitioning: data is sliced reasonably according to special rules (such as file name, unique identification of data, or hashing algorithm) to generate data execution context Execution Context and job step actuator Step Execution for different slices. Custom partition logic can be generated through the interface Partitioner. The Spring Batch batch framework implements multi-file implementation by default, and org.springframework.batch.core.partition.support.MultiResourcePartitioner; can also extend the interface Partitioner to achieve custom partition logic.

Partition processing: after passing the data partition, different data has been assigned to different job step executors, and then it needs to be handed over to the partition processor for the job, which can be executed locally or remotely. The interface PartitionHandler defines the partition processing logic, and the Spring Batch batch processing framework implements the local multi-thread partition processing by default. Org.springframework.batch.core.partition.support.TaskExecutorPartitionHandler; can also extend the interface PartitionHandler to implement the custom partition processing logic.

The Spring Batch framework provides support for file partitioning, and the implementation class org.springframework.batch.core.partition.support.MultiResourcePartitioner provides default support for file partitioning, partitioning different file processing according to file names to improve processing speed and efficiency. It is suitable for scenarios with a large number of small files to deal with.

The example shows assigning different files to different job steps, using MultiResourcePartitioner for partitioning, meaning that each file is assigned to a different partition. If there are other partition rules, you can customize the extension by implementing the interface Partitioner. Interested TX, you can achieve their own database-based partitioning capabilities.

To sum up, the batch framework provides four different capabilities in scalability, each with its own usage scenario, which we can choose according to the actual business needs.

Deficiency and Enhancement of batch Framework

Although the Spring Batch batch framework provides four different monitoring methods, it is not very friendly from the current usage.

Viewing directly through DB, managers really can't bear to look directly at it.

It is a paradise for programmers and hell for operators to implement custom queries through API.

The Web console is provided to monitor and operate Job. The functions currently provided are too naked to be directly used in production.

Provide JMX query method, which is too unfriendly for non-developers

However, in enterprise applications, in the face of batch data processing, only providing batch processing framework can only meet the rapid development and execution ability of batch processing jobs.

Enterprises need a unified batch processing platform to deal with complex enterprise batch processing applications, batch processing platform needs to solve the unified scheduling of jobs, centralized management and control of batch jobs, unified monitoring of batch jobs and other capabilities.

So what's the perfect solution?

The enterprise batch processing platform needs to integrate the scheduling framework on the basis of the Spring Batch batch processing framework, through which the tasks can be executed regularly according to the needs of the enterprise.

Enrich the current Spring Batch Admin (Spring Batch management and monitoring platform, the current capability is relatively weak) framework, provide unified management functions for Job, and enhance the monitoring and early warning capabilities of Job jobs.

Through the reasonable integration with the organization, authority management and authentication system of the enterprise, the authority control and security management ability of the platform to Job operations are enhanced.

Conclusion

Thank you for watching. If there are any deficiencies, you are welcome to criticize and correct them.

In order to help you make learning easier and efficient, we will share a large number of materials free of charge to help you overcome difficulties on your way to becoming big data engineers and even architects. Here to recommend a big data learning exchange circle: 658558542 welcome everyone to enter × × × stream discussion, learning exchange, common progress.

When you really start learning, it is inevitable that you do not know where to start, resulting in inefficiency that affects your confidence in continuing learning.

But the most important thing is not to know which skills need to be mastered, step on the pit frequently while learning, and eventually waste a lot of time, so it is necessary to have effective resources.

Finally, I wish all the big data programmers who encounter bottle disease and do not know what to do, and wish you all every success in the future work and interview.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.