Enumeration of storage systems and computing platforms in the data center 10/29 Update SLTechnology News&Howtos

Enumeration of storage systems and computing platforms in the data center

2025-10-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Author: Xiang Shifu transferred: Alibaba data Zhongtai official website https://dp.alibaba.com collection & transport layer Sqoop Hadoop, a tool for transferring data between relational databases. During transmission, multiple MR jobs will be started to transfer data concurrently. DataX Alibaba's open source data synchronization tool is used to synchronize data between various heterogeneous data sources. Such as RDBMSHadoop/MaxCompute, RDBMShbase/ftp and so on. Deployment, operation and maintenance is very simple, copy the jar package of DataX into the linux system, you can run Flume distributed and highly available data collection and aggregation tools. It is usually used to collect data from other systems, such as logs generated by web servers, combined with the message queue function of Kafka to achieve real-time log processing and offline log delivery. The typical usage scheme is: offline calculation: application Syslog-> flume-> kafka-> hdfs-> MR job real-time calculation: application Syslog-> flume-> kafka-> blink/jstorm/storm/spark streamingLogstash server-side data collection tool, which can collect and convert data from multiple sources at the same time. The log collection function is similar to Flume's distributed messaging system based on publish / subscribe mechanism in Kafka. Commonly used for log delivery and distribution scenarios RocketMQ Alibaba open source message queuing tool. After the baptism of double 11 scenario, the storage layer HDFS Hadoop distributed file system (HDFS) is designed to be suitable for distributed file system running on general hardware (commodity hardware). HDFS is a highly fault-tolerant system that is suitable for deployment on cheap machines. HDFS can provide high-throughput data access, which is very suitable for applications on large-scale data sets. HDFS relaxes some of the POSIX constraints to enable streaming read data files HBase Hbase is a distributed, KV-queried open source database (in fact, column families, to be exact). HDFS provides reliable underlying data storage services for Hbase, MapReduce provides high-performance computing power for Hbase, Zookeeper provides stable services and Failover mechanism for Hbase, and LSM data storage format provides high-performance reading and writing capabilities. Redis Redis is a key-value storage system. It is written in ANSI C language, complies with BSD protocol, supports network, can be based on memory and can be persisted, and provides API in multiple languages. It provides Ceph open source distributed storage system with data structures such as Hash, list, sets and sorted sets, and provides three storage functions: block storage RDB, distributed file storage Ceph FS, and distributed object storage Radosgw. At present, Apache Parquet is one of the few open source storage middleware storage formats with various storage capabilities. Apache ORC, Huawei Carbondata, Kudu, Avro, etc. In the field of big data, different data storage formats are used for different business scenarios. The differences of these storage formats are mainly reflected in row, column storage, pre-calculation layer 1. Offline computing Hive Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table and provide simple sql query functions. Sql statements can be converted into MapReduce tasks to run. Its advantage is that the learning cost is low, simple MapReduce statistics can be quickly realized through SQL-like statements, and there is no need to develop special MapReduce applications, so it is very suitable for statistical analysis of data warehouse. Is the de facto offline data warehouse standard. Spark Apache Spark is a fast and general computing engine specially designed for large-scale data processing. Spark is a general parallel framework like Hadoop MapReduce opened by UC Berkeley AMP lab (AMP Lab of the University of California, Berkeley). Spark has the advantages of Hadoop MapReduce, but what is different from MapReduce is that the intermediate output of Job can be saved in memory, so there is no need to read and write HDFS, so Spark can be better applied to iterative MapReduce algorithms such as data mining and machine learning. Developed by MaxCompute Alibaba and based on the principle of MR, the big data processing platform has been exported through Aliyun. It is a fast and fully managed TB/PB-level data warehouse solution. CDH CDH is the software distribution of Cloudera, including Apache Hadoop and related projects. All components are 100% open source (Apache license). 2. Real-time computing Storm/Jstorm distributed, highly fault-tolerant real-time computing system, which was widely used before 2014, has been initially replaced by other stream computing products in recent years. Flink Flink is a low-latency, high-throughput, unified big data computing engine. In Alibaba's production environment, Flink's computing platform can process hundreds of millions of messages or events per second with millisecond latency. At the same time, Flink provides a consistent semantics of Exactly-once. The correctness of the data is guaranteed. This enables the Flink big data engine to provide financial-level data processing capabilities. Spark Streaming Spark Streaming, similar to Apache Storm, is a stream computing processing framework. Spark Streaming has two characteristics: high throughput and strong fault tolerance. In Spark Streaming, the unit of processing data is a batch rather than a single piece, but the data collection is carried out one by one, so the Spark Streaming system needs to set an interval to make the data aggregate to a certain amount and then operate together, this interval is the batch interval. Batch interval is the core concept and key parameter of Spark Streaming, which not only determines the frequency of Spark Streaming job submission and the delay of data processing, but also affects the throughput and performance of data processing. The data service layer Kylin open source distributed analysis engine provides SQL query interface and multidimensional analysis (OLAP) capability over Hadoop/Spark to support very large-scale data. The core principle is data precomputation, which uses space for time to speed up OLAP queries with fixed query patterns. The latest version already supports real-time data import. Druid Druid is also a very popular olap engine, based on MPP architecture, using four methods of pre-aggregation, column storage, dictionary coding and bitmap index to speed up query performance. As of Sept. 22, 2019, Druid natively does not support precise data deduplication. Kuaishou has applied Druid to the production environment. Presto Presto is an open source distributed SQL query engine, which is suitable for interactive analytical queries. The amount of data is supported from GB to PB bytes. Presto is designed and written entirely to solve the problem of interactive analysis and processing speed of commercial data warehouses of the size of Facebook. Lucene Lucene is a full-text information retrieval toolkit based on Java. At present, the mainstream search systems Elasticsearch and solr are based on lucene indexing and search capabilities. ElasticSearch Lucene-based search server. It provides a full-text search engine with distributed multi-user capabilities. Solr Solr is the open source enterprise search platform for the Apache Lucene project. Its main functions include full-text retrieval, hit marking, faceted search, dynamic clustering, database integration, and rich text processing. Solr is highly extensible and provides distributed search and index replication. Solr is the most popular enterprise search engine, and Solr 4 also adds NoSQL support. Palo Baidu's open source olap engine is widely used within Baidu. Based on MPP architecture, Google Mesa and Cloudera Impala are integrated. Alibaba data Center team, committed to the output of Aliyun data intelligence best practices, to help each enterprise build its own data center, and then work together to achieve intelligent business in the new era! Alibaba data center solution, core product: Dataphin, with Alibaba big data core methodology OneData as the core driver, provides one-stop data construction and management capabilities; Quick BI, a collection of Alibaba's data analysis experience, provides one-stop data analysis and presentation capabilities Quick Audience, which integrates Alibaba's consumer insight and marketing experience, provides one-stop crowd selection, insight and marketing capabilities to connect Alibaba's business and achieve user growth.

The original link to this article is the original content of Yunqi community and may not be reproduced without permission.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.