In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
Today, I will talk to you about the research and implementation of the migration strategy from MySQL to HBase, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.
With the advent of Web2.0, Internet data is growing rapidly. The collection, processing and application of large-scale data directly affect the user experience and determine the development of the enterprise. Comparing the performance of traditional relational database and distributed non-relational database in dealing with big data, it is imperative to migrate local data to distributed database. This paper analyzes the advantages and disadvantages of existing migration tools, puts forward an effective data migration strategy based on HBase database, and implements a semi-automatic migration tool according to the proposed strategy. Taking the CityDetail database data of American city and dialect system as an example, this paper expounds the working principle of the migration tool and makes many queries and comparisons of the migrated data, which proves the high efficiency of using this tool for data migration.
Relational database; HBase; migration tool
In the era of Web2.0, with the rapid development of network technology, individuals and enterprises are constantly creating massive data. in the new Nuggets tide, how to use data and the speed of transforming data into valuable information has increasingly become a decisive factor for the success of enterprises. Realize the rapid analysis of data, and give guidance, which poses a greater challenge to data storage. Although the traditional database has been developed for many years and has formed a large scale in enterprise applications, there are still some deficiencies in dealing with big data applications. First of all, the traditional relational database can not complete the large-scale horizontal expansion, although the network solution has solved this problem to a certain extent, the cluster can not be created dynamically in the network; secondly, the relational database can not effectively store the semi-structured and unstructured data in the Web2.0 era; in addition, the traditional relational database can not meet the needs of efficient query of massive data in big data era.
The emergence of non-relational database makes up for the deficiency of traditional relational database in dealing with large-scale data. Non-relational database is a general term for many databases that support non-relational and weak relational data storage, such as Cassandra, MongoDB, HBase and so on. The table in the non-relational database mainly adopts the aggregated storage structure, which makes the data management more convenient [2]. Massive data storage is easily realized through the pre-allocation mechanism, and can be expanded by continuously adding service nodes. there is no need for downtime maintenance and data migration. In addition, many non-relational databases have strong business pertinence and have a subversive improvement in application performance compared with traditional relational databases. Among them, HBase has been applied by major Internet enterprises by virtue of its seamless integration with Hadoop, strong high scalability and the advantage of a huge diversified community.
With the wide application of HBase, the migration of historical data stored in traditional relational databases to HBase has become a research hotspot.
1. Research status at home and abroad.
In view of the research on the migration of traditional relational data to HBase, at present, the industry has only proposed some methods of data migration, but there are few authoritative data migration tools, let alone the original table schema or automatic migration tools.
The existing migration tools such as Hadoop's official tool Sqoop only support the incremental loading of a single table and cannot complete the migration of many table schemas in the database system; HBase's Importtsv tool only supports the migration of specified files such as TSV; although the Put method is simple and direct, it only completes the data migration and the migration efficiency is not good. In addition, large Internet companies at home and abroad, such as Microsoft and Huawei, are also scrambling to develop their own migration tools, but most of them are based on their own business applications [4].
To sum up, it is necessary to implement an automated or semi-automated data migration tool. In this way, we can make greater use of the historical data stored in the original relational database, reduce the waste of precious resources such as the relationship between the data, and avoid manual re-entry. This paper makes an in-depth study on the storage principle and table structure of relational database MySQL and non-relational database HBase, and takes CityDetail system as an example to explain the idea of migration from traditional relational database to HBase, and designs and implements the migration tool. Finally, the efficiency of database query is verified after data migration by this method.
2. analysis of the storage principle of the database.
2.1. Storage principle of relational database
Relational database [5] is a kind of database based on relational model. In the relational database, a two-dimensional table is used to represent the entities in the real world, the fields in the table are used to represent the attributes of the entities, and foreign keys and other joint operations are used to represent the relationship between the entities. A row in a table, that is, a record represents an entity, and one or more such tables and their relationships make up a relational database.
INFORMATION_SCHEMA database is installed by default in relational database MySQL. The INFORMATION_SCHEMA database stores information such as table names, column names, number of records, primary keys, foreign keys, procedures and methods of all databases in MySQL. The data stored in INFORMATION_SCHEMA is called metadata of the database system. As shown in figure 1.
Metadata is the data used to describe data [6], which is used to support functions such as data storage location, historical data, resource search and so on. Metadata can be regarded as an electronic catalog to assist in data retrieval. When using DESCRIB and other SQL statements to retrieve in a relational database, it is the metadata in the database that is queried. Therefore, in the process of data migration, the query of metadata tables in relational database can be used to quickly obtain the schema of each table and the relationship between tables in relational database, and then migrate.
2.2.2.The storage principle of HBase
Non-relational database HBase is an open source implementation of Google's BigTable database. It is often described as a sparse, distributed, persistent mapping [7]. The logical view in HBase is shown in figure 2.
You can see from figure 2 that HBase's table is a sparse matrix. HBase differs from traditional relational database tables in that it can store semi-structured data, that is, tables in HBase have no strict restrictions in design [8], and data records may contain inconsistent columns, uncertain sizes, and so on. In addition, unlike relational databases, HBase is based on columns rather than rows in storage, so it has better query performance for data in the same column. HBase tables can have millions of columns and billions of rows, so they can be used to store large-scale data. The following thinking data models are actually defined in HBase, which are:
(1) Table. HBase uses tables to organize data, and the table name is a string.
(2) Row keys. In the HBase table, the data is stored by row. Rows are marked by row keys, which have no data type and are always treated as byte arrays.
(3) Lei clans. The data in the table is organized into column families in rows, and column families also affect the physical storage of HBase data. Column families are stored in HBase's own database, so column families are defined when creating tables and cannot be easily modified. In addition, each row in HBase has the same column family, and you can have different column qualifiers under the same column family.
(4) column qualifier. The data in the column family is located by the column qualifier or column. Unlike column families, column qualifiers do not have to be defined beforehand. Column qualifiers do not have to be consistent between rows. Column qualifiers have no data type and are always treated as byte arrays.
(5) Unit. Row keys, column families, and column qualifiers determine a unit. The data stored in a cell is called a cell value. The value has no data type and is treated as a byte array byte [].
(6) time version. HBase uses versions to store cell values at different times, and 3 versions are stored by default. The time version is identified by a timestamp.
Physically, HBase's data is stored in HDFS, making good use of HDFS's distributed processing model and benefiting from Hadoop's MapReduce program model. The logical table of HBase is divided into multiple HRegion,HRegion in the direction of rows, and there is only one Region at the beginning of each table. With the continuous increase of the number of records, the Region increases continuously. When it grows to a certain extent, the HRegion will be equally divided into two new HRegion. HRegion is the smallest unit of distributed storage and load balancing in HBase, but it is not the smallest unit of storage. The HRegion consists of one or more Store, and each Store holds a column family in the table. Each Store consists of a Memstore and 0 to more StoreFile (HFile). StoreFile is used to store data and store it on HDFS in the form of HFile [9].
3. The main modules of the migration tool
The main modules of this migration system are as follows.
3.1. Extract the table schema from the source database
Through the analysis of the storage structure in the traditional relational database, we can know that the INFORMATION_SCHEMA database stores the metadata of all the data tables in MySQL, so through the access to these metadata, we can quickly extract the table schema of all the source data in the MySQL database to be migrated.
The SCHEMATA table in the INFORMATION_SCHEMA database provides information about all the databases in the current MySQL instance, and the result of the SQL query language show datatables is from this table. The TABLES table provides information about the tables in the database, detailing which SCHEMA a table belongs to, as well as the table type, table name, number of records per table, and creation time. The COLUMNS table provides the column information in the table, detailing all the columns of a table and the information for each column. The STATISTICS table provides all the index information in the table, in addition to metadata that describes the user rights of the table, and so on. Through joint access to these tables, you can quickly extract the source database schema and avoid prolonging the response time due to access to the data tables in the database.
3.2. Conversion of table mode
Through the study of the storage structure of HBase database, we can see that the structure of tables in HBase database is different from that of traditional relational database. There is no association between the tables in HBase, and there are no operations such as Join join queries in relational databases. In order to migrate the table schema, the interrelated data in the traditional relational database must be migrated to the same row in the HBase. Considering the special table structure and storage structure of HBase database, in order to ensure that the migrated data does not affect the business function as much as possible, the table schema of the data is transformed as follows [10]:
(1) conversion of basic tables
The basic transformation of all tables in the CityDetail system is to migrate the source data table directly to the HBase side. Take the table name of the source data table as the table name of the HBase table, the primary key as the row key, and the combination of table name and column name as the column qualifier in the HBase side table, and the version is set to 1.
(2) embedded conversion
In CityDetail system, there is an association between Country table and City table, and there is also an association relationship between Country table and CountryLanguage table. The physical storage structure in HBase determines that different column families of HBase table are stored in different Store files, and because the join query operation of different tables in the source data is far less than the operation of a single table, City table and CountryLanguage table can be stored as a column family of Country table respectively. Therefore, in order to achieve the migration of this kind of table, it is necessary to retain the table schema of Country, and then split the City table and CountryLanguage table and add it to the Country table as a column family of the Country table.
(3) Recursive transformation
In the CityDetail system, there is not only the association between the Country table and the City table, but also the correlation between the next level, such as the Detail table and the City table. In order to complete the migration of this type of table, it is necessary to carry out a deep embedded transformation of Country table and City table on the basis of embedded transformation of City table and Detail table. According to the recursive principle, the Detail table is divided as a column family in the corresponding rows of the City table, and then the City table is segmented and migrated as a column family of the Country table.
(4) Segmentation conversion
According to the relational paradigm of relational database [11], it is possible that the same table Describe is a child of the outer three tables Country, City and CountryLanguage. The transformation of this kind of relationship can be accomplished by dividing the Describe table and adding it to the corresponding column families of the three tables.
Through the integrated application of the above four transformation methods, the migration of all the table schemas of the CityDetail system is finally completed.
4. Design and implementation
The flow chart of the migration system designed in this paper is shown in figure 3.
(1) Connect to the relational database MySQL
First, load the MySQL JDBC driver in the Java program using the Class.forname statement, then create a new connection with the statement "Connection conn = DriverManager.getConnection (url, user, password)", and then access the metadata of the database to get the table schema.
(2) pattern conversion
Traverse all the table schemas obtained in (1) and use the four transformations mentioned above to transform the table schemas in the migrated HBase.
(3) connect to HBase database
Get the configuration information in the HBase database through the statement "Configuration conf=HBaseConfigurAtion.create", then create a new table in HBase with the statement "table=new HTable (conf,tablename)", and specify the names of each column family with the statement "byte [] family=Bytes.toBytes (" n ")" according to the table schema converted in (2). At this point, the table schema of the migration system has been migrated.
(4) data migration
In the data migration module, two databases should be connected respectively. First of all, connect the MySQL database, create a MySQL Object to access the database specified in the MySQL, cycle through the data with SQL query statements to obtain the records in the database, then connect the created HBase data table to generate HBase Object, insert the data records obtained by the SELECT query into the HBase table in turn with the Put method, and finally close the data object and complete the data migration.
5. Test and conclusion
The experimental test is carried out on the Hadoop cluster, which consists of four hosts, each of which has Hadoop, HBase and Zookeeper installed. The cluster information is shown in Table 1.
This experiment uses the data migration tool proposed in this paper and the official data migration tool Sqoop configured by Apache for Hadoop to migrate the CityDetail system, and uses different size data sets to compare the migration process and results of the two tools.
First of all, in the aspect of the complexity of the migration process, the Sqoop tool completes the data migration by specifying parameters on the terminal, including the statement to connect to the database, the data table name before and after migration, the attribute name and so on. The parameter assignment is tedious and not easy to operate. In this paper, the migration tool from the acquisition of table schema to the establishment of HBase data tables and migration data are automatically completed by the system, relatively speaking, a high degree of automation.
Secondly, in terms of query performance, because the migration tool of this paper completes the transformation and migration of the table model, while Sqoop only migrates the data in the specific table mechanically, it does not migrate the table schema. Compared with the two, the former migration result has great advantages. Take the typical SQL query statement "SELECT Name,Language where Country.CountryCode=Language.CountryCode" as an example, the query results of both are shown in figure 4.
As you can see from figure 4, the query performance of this migration tool is much better than that of Sqoop. Because the migration tool of this paper stores the information belonging to the same record in a HRegion and the data in the same table in the same Store file through the transformation of the table schema, the process of multiple addressing is reduced when querying, thus reducing the system response time.
6. Conclusion
Through the case analysis of the migration of CityDetail system from MySQL database to HBase database, this paper studies the principle of storing data in MySQL database and HBase database, and puts forward a method of quickly extracting table schema and transforming migration by accessing MySQL database metadata, which solves the problem that the previous migration tools can not migrate table schema. On the premise of ensuring the data integrity as far as possible, the migration speed, the degree of automation and the query performance of the migrated data are improved. However, due to the existence of * index in HBase, the query performance on multi-conditional query will certainly be much lower than that of MySQL, so the optimization of index needs to be studied and studied.
After reading the above, do you have any further understanding of the research and implementation of the migration strategy from MySQL to HBase? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.