What are Apache's four large open source data and data lake systems? 11/21 Update SLTechnology News&Howtos

What are Apache's four large open source data and data lake systems?

2025-11-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article introduces the knowledge of "what are the four large open source data and data lake systems of Apache?". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Many of the functions needed to manage big data are transactions, data mutation, data correction, streaming support, and architectural evolution, because Apache provides four acidic transaction capabilities for meeting and managing big data.

Apache Sharding Sphere

It is a well-known database middleware system. It contains three separate modules, JDBC,Proxy and Sidecar (Plan), but they are all mixed together at deployment time. Apache Shardingsphere provides standardized data sharding, distributed transactions and database governance capabilities for a variety of applications, such as Java synonyms, heterogeneous languages and cloud native machines.

Today's e-commerce mainly depends on relational databases and distributed environments. The surge of efficient queries and rapid data transfer have become the main targets of corporate relational databases. Apache Shardingsphere is a great relational database middleware ecosystem, which provides its developers with reasonable calculation and storage of functional relational databases.

Apache Iceberg

Apache Iceberg was originally designed and developed by Netflix. The key idea is to organize all the files in the directory tree, and if you need to create a file in Apache iceBerg in May 2018, you just need to find that file and read it only, and there is no need to read other files that you can read to ignore other data that is not important to the current situation. The core idea is to track all changes in the table on the schedule.

It is a data lake solution for tracking very large tables, and it is a lightweight data lake solution designed to solve the problem of listing large partitions and time-consuming and inconsistent metadata and HDFS data. It contains three types of table formats wood, Avro and Orc.in Apache iceberg table formats perform the same thing as file collections and file formats, allowing you to skip data in a single file

It is a new technical format for tracking and controlling on very large and scale tables. It is designed for object storage (for example, S3). The more important concept in Iceberg is a snapshot. A snapshot represents a complete set of table data files. Generate a new snapshot for each update operation.

Apache Iceberg has the following characteristics:

ACID transaction capability, which can write upstream data without affecting the current running data processing tasks, which greatly simplifies ETL; Iceberg to provide better merging capabilities and can greatly reduce data storage latency

Support for more analysis engine excellent kernel abstraction so that it is not bound to a specific computing engine. Currently, the computing engines supported by icebergs are Spark,Flink,Presto and Hive.

Apache Iceberg provides unified and flexible data for file storage, organization, flow-based incremental computing models and batch-based full-scale computing models. Batch and streaming tasks can use a similar storage model and no longer isolate data. Iceberg supports hidden partitions and partition evolution, which facilitates business updating data partitioning strategies. Three storage formats are supported: Wood, Avro and Orc.

Incremental read processing capability iceBerg supports streaming of incremental data, streams and transfer table sources.

Apache Hudi

Apache Hudi is a big data incremental processing framework that tries to solve the efficiency problem of ingesting pipes and the need to insert, update and incrementally consume primitive ETL pipes in big data. It is a data storage abstraction optimized for analysis and scanning that can apply changes to datasets in HDF in minutes and support multiple incremental processing systems to process data. Make the framework seamless as an end user through the integration of custom InputFormat with the current Hadoop ecosystem, including Apache Hive,Apache Parquet,Presto and Apache Spark.

Hudi is designed to update datasets on HDFS quickly and step by step. There are two ways to update data: read, write, write and merge reads. The copy on the write mode is that when we update the data, we need to get the files involved in the updated data through the index, then read the data and merge the updated data. This mode makes it easier to update the data, but is very inefficient when the data involved is updated; and the merge read is to write the update to a separate new file. then we can choose to synchronize or asynchronously merge the updated data with the original data (which can be combined with the original data), because the updated only write new files, so this mode will be newer and faster.

With the help of the Hudi system, it is easy to collect incremental data in MySQL,HBase and Cassandra and save it to Hudi. Presto,spark and hive can then quickly read these incrementally updated data.

Apache Iotdb

It is a time series industrial database of the Internet of things. Apache IOTDB is a software system that integrates, stores, manages and Anallyze Thge IoT time series data. Apache IOTDB adopts a lightweight architecture with high performance and rich functions, and is deeply integrated with Apache Hadoop,Spark and Flink, which can meet the needs of large-scale data storage, high-speed data reading and complex data analysis in industry.

The Apache IOTDB suite consists of several components that work together to form a series of functions, such as "data collection-data writing data storage-data query-data visual data analysis". Its structure is as follows:

Users can import time series data collected from sensors on the device, time series data in message queues such as server load and CPU memory, time series data of applications, or time series data JDBC from other databases to local or remote IOTDB. Yes. Users can also write the above data directly to a local (or on HDFS) TSFile file. The TSFile file can be written into HDF to realize the data processing tasks such as anomaly detection and machine learning of the data processing platform. For writing to HDFS or local TSFile files, you can use TSFile-Hadoop or TSFile-Spark connectors to allow Hadoop or Spark to process data. The analysis results can be written back to the TSFile file. IOTDB and TSFile also provide client-side tools to meet the needs of users to view data in SQL, script and graphics formats.

This is the end of the content of "what are the four large open source data and data lake systems of Apache". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.