How to solve the coexistence of new and old storage under the separation of deposit and calculation of Hadoop big data 04/10 Update SLTechnology News&Howtos

How to solve the coexistence of new and old storage under the separation of deposit and calculation of Hadoop big data

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Hadoop big data storage separation under how to solve the coexistence of new and old storage, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

In the traditional Apache Hadoop cluster system, computing and storage resources are closely coupled. While HDFS brings convenience to big data storage, it also faces some challenges:

When there is insufficient storage space or computing resources, both can only be expanded at the same time. Assuming that users' demand for storage resources is much greater than that for computing resources, then after expanding computing and storage at the same time, the new expanded computing resources will be wasted, on the contrary, storage resources will be wasted.

This leads to lower economic efficiency of capacity expansion and additional costs. Independent expansion of computing and storage is more flexible, while significantly reducing costs.

Nowadays, it is more and more obvious that Hadoop adopts the architecture of separation of deposit and calculation.

XSKY HDFS Client is a connector tailored for XEOS storage clusters and Hadoop computing clusters. All the data stored in XEOS can be accessed through the XSKY HDFS Client,Hadoop application.

However, after the introduction of XEOS storage, there will be the coexistence of the original HDFS and XEOS. How to make use of both sets of storage clusters is a problem that needs to be solved.

one

Copy data across clusters

In general, if the data that the computing application needs to access is stored in a different cluster, then the data from one cluster should be copied to the other. In general, the DistCp tool that comes with Hadoop is used to copy data across clusters.

Although this method can solve the problem of data merging to some extent, it may take a long time to copy data if the amount of data is large and the bandwidth of the computer room is limited. Another is that when the original data is changed during the copy process, the problem of incremental synchronization needs to be considered.

two

Federal HDFS and ViewFS

The federated HDFS feature has been introduced in the Hadoop 2.x release, which is expected to solve the memory problem of NameNode. The federated HDFS allows the system to extend by adding multiple NameNode, where each NameNode manages a portion of the file system namespace.

However, in practical applications, system administrators need to maintain multiple NameNodes (all NameNode need to be highly available) and load balancing services, which increases the management cost. So HDFS's federal solution is not adopted by the production environment.

Along with the federated HDFS scheme, Hadoop 2.x also provides ViewFS to manage all multiple namespace views.

Although the federal HDFS scheme has not been applied on a large scale, ViewFS can be used to solve the problem of coexistence of XEOS and HDFS.

The realization of ViewFS

ViewFS, whose full name is ViewFileSystem, is not a new file system, but a logical view file system that implements the standard Hadoop FileSystem interface. However, the real request processing is still on the respective real storage cluster.

ViewFS maintains a mount-table, mainly the mapping of the logical directory of the viewfs to the actual underlying storage. When receiving the call from the application, ViewFS parses the user's access request, finds the corresponding underlying storage directory through mount-table, and forwards the corresponding request to the underlying storage.

ViewFS transmits all FileSystem calls from the application layer to the underlying real file system. Because ViewFs implements the Hadoop file system interface, use it to run the Hadoop tool transparently. For example, all shell commands can use ViewFS with HDFS and the local file system.

In the core-site configuration of the cluster, fs.defaultFS is set to the root directory of ViewFS, that is, the specified mount-table.

Add the mount-table configuration of ViewFS to the configuration of the cluster, as shown below:

The Hadoop system will look for the mount-table named "ClusterX" in the Hadoop configuration file. Include "ClusterX" in all gateway and server configurations, as in the example above.

four

Application scenarios of ViewFS

ViewFS can be used in the following scenarios:

The unstructured raw data can be directly stored on XEOS through DistCp and other tools, and the structured data of business database and application buying point data can be stored in XEOS through ETL in the form of external tables of Hive. HBase and Hive continue to run on the original HDFS, that is, HBase table data and Hive internal table data are still stored through HDFS.

This advantage is that massive unstructured data and even massive small files can be carried by XEOS to reduce the pressure on HBase. At the same time, all the new data in Hive are stored through XEOS, and the subsequent capacity expansion can only expand the XEOS storage cluster.

five

XEOS configuration ViewFS

Big data platform is based on CDH 6.3.2. HDFS core-site.xml adds the following configuration:

Hadoop FS command line:

The results of performing the wordcount test are as follows:

Through the way of ViewFS, XSKY connects the original HDFS data with the new XEOS data without changing the users' habits, and solves the problem of coexistence of the original HDFS cluster and the new XEOS cluster. The original HDFS data can continue to be used, while XEOS can be used to host the newly generated data.

This method can not only make full use of the old equipment to achieve the purpose of cost saving. At the same time, the storage capacity can be expanded separately with the help of the horizontal expansion capability of XEOS.

This is the answer to the problem of how to solve the coexistence of new and old storage under the separation of Hadoop big data storage and calculation. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.