Construction Plan of Scientific Research platform for Mining and Analysis of big data of Tourism Research Institute 07/01 Update SLTechnology News&Howtos

Construction Plan of Scientific Research platform for Mining and Analysis of big data of Tourism Research Institute

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

one。 Background

I. 1 data mining and big data analysis of industry background and development trend

With the rapid development of mobile Internet, e-commerce and social media, the amount of data that enterprises need to face increases exponentially. According to the IDC Digital Universe (Digital Universe) research report, the amount of new and copied information in the world in 2020 has exceeded that of 40ZB, 12 times that of 2015, while the amount of data in China will surpass 8ZB in 2020, an increase of 22 times over 2015. The rapid growth of data has brought about the prosperity and development of big data's technology and service market. The latest market research on big data and BDA in IDC Asia Pacific (excluding Japan) shows that the market size of big data's technology and services will increase from US $548 million in 2012 to US $2.38 billion in 2017, with a compound growth rate of 34.1% in the next five years. The market covers the storage, server, network, software and service markets. The growth of data is a non-linear growth rate.

According to the IDC analysis report, in the past year, there have been more and more application cases of big data and analysis in the Asia-Pacific region. In China, from Internet enterprises to traditional industries such as telecommunications, finance and government, have begun to adopt a variety of big data and analysis techniques to begin their own practical journey of big data; the application scene is also gradually expanding, from structured data analysis to semi-structured, unstructured data analysis, especially social media information analysis has attracted more attention from users. Users began to evaluate the new technologies related to big data, represented by Hadoop, database all-in-one and in-memory computing technology.

The latest research results show that improving competitive advantage, reducing costs and attracting new customers are the three returns that Chinese users expect most from big data's project. At present, the existing big data projects mainly focus on the application of business process optimization and improving customer satisfaction. IDC found that many users wanted big data to bring business innovation to the enterprise and began to use advanced analytics solutions to manage complex data environments. Over the past year, users' attention to social data collection and analysis applications has increased significantly. In the future, geographic location information analysis will grow rapidly, which will also promote users' attention to big data's security and privacy management. In the Asia-Pacific region, Australia and Singapore, users' investment in big data is mainly in consulting services, focusing more on how to design and implement solutions based on new best practices. China and India's hardware investment in big data is obvious, preferring to invest in data center-related infrastructure.

In traditional data analysis and business data mining, people usually follow the "28" principle. That is, 20% of the users of the task provide 80% of the value, so take advantage of the advantage of resource users to serve a small number of users. With the development of the Internet, more and more low-value users have entered the business system, and these users have become the target of competition among commercial enterprises. For example, in the e-commerce industry, a large number of customers are low-value customers in the traditional sense. The data show that the mining of this part of user value can change the 28 principle, and even achieve almost uniform distribution of value. And due to the development of computing technology, the analysis of big data has become possible.

I. 2 the significance of big data's analysis and application in tourism industry

The tourism industry has the characteristics of wide industry, large scale and strong mobility, so it depends more on big data. At present, the tourism industry has also ushered in the challenge of upgrading and the opportunity of change under the "new normal". For the general economic sector, the economic speed slows down, the per capita GDP growth rate decreases, and many traditional industries are adjusting the structure, but the new normal is faster for the tourism industry. The solution of tourism big data is to integrate domestic multi-channel large data sources, form tourism big data ecology, provide big data solution for domestic tourism, and promote the transformation and upgrading of tourism.

I. 3 the necessity of data mining and the construction of big data analysis and scientific research platform

Data mining and big data analysis is a comprehensive discipline based on computer, with mining algorithm as the core, and closely oriented to industry application. Its main technology involves probability theory and mathematical statistics, data mining, algorithm and data structure, computer network, parallel computing and other professional directions, so this discipline has higher professional requirements for scientific research platform. The scientific research platform should not only provide the basic programming environment, but also provide big data's computing environment and the actual combat big data case for scientific research. The preparation of these materials needs a complete scientific research platform as a support.

At present, the disciplines related to data mining and big data analysis in colleges and universities in China include: computer science and technology, information management and information systems, statistics, economy, finance, trade, bioinformation, tourism and public health. These majors have different emphasis when using scientific research platform, different levels of users, and different use of algorithms, therefore, it is very necessary to build a convenient, easy to operate, comprehensive and visual big data scientific research platform.

two。 Data Mining and the overall Planning of big data Analysis and Scientific Research platform

II. 1 Scientific research platform planning

The basic principle of the construction of scientific research platform is to give priority to scientific research, and at the same time provide some computing resources and security resources for teaching experiments. the system shares the computing resources of scientific research system within the authorized scope to improve the authenticity of teaching experiments.

The overall architecture of the project is shown in figure 1.

Figure 1. Overall architecture diagram

In the whole system, the gigabit core switch is used as the core node, and two gigabit access switches are used as the switching nodes in the scientific research and experimental environment. The scientific research environment is based on the commercial Hadoop cluster developed by our company, and the upper layer integrates the easy-to-operate big data scientific research application system, the 10TB big data case set and draggable data algorithms and visualization algorithms.

II. 2 function planning of scientific research platform

This scientific research platform has big data analysis and research content for data mining, taking into account the needs of scientific research and teaching, which can not only meet the requirements of big data analysis high-performance platform in scientific research work, but also has the characteristics of simple and easy to use teaching experiment platform.

1) big data resource planning

Built-in commercial-grade data resources, planning data resources according to common scientific research classification, can be directly used in scientific research, with data resources authorization management and control function.

2) big data analysis function planning

The construction of a commercial version of Hadoop as the core of big data analysis platform, the system provides MapReduce and Spark and other big data mining functions. The system has a complete management and scheduling function.

3) hardware resource function planning

The system has 24 Intel Xeon E5 CPU computing capabilities, provides more than 40TB storage capacity and more than 1T of memory, can meet the synchronic computing energy of 1000 tasks, and is convenient for expansion.

three。 Data Mining and the Construction Scheme of big data Analysis and Scientific Research platform

III. 1 equipment architecture of big data scientific research platform

Figure 3. Device architecture

III. 1.1 Primary node and backup primary node

The master node is responsible for the operation of the whole distributed big data platform. The master node always keeps the directory structure of the entire file system in memory, which files are in each directory, which parts of each file and which calculation each block is saved in, which is used to process read and write requests. At the same time, the master node is also responsible for decomposing the job into subtasks and assigning these subtasks to each computing node. The backup master node undertakes all kinds of tasks of the master node when the master node fails, so that the distributed big data platform can still operate normally.

III. 1.2 Management Node

The management node is used to manage the entire distributed big data platform, which can be used for node installation, configuration, service configuration, etc., and provides a web window interface to improve the visibility of system configuration and reduce the complexity of cluster parameter settings.

III. 1.3 Interface Node

End users connect and use the distributed big data platform through the interface node, submit tasks and obtain results, and can use other data analysis tools for further processing to interact with the outside world (such as connecting to relational databases).

III. 1.4 Compute nodes

The distributed big data platform contains multiple computing nodes. A computing node is a node that really stores data and does data operations in the system. Each computing node communicates periodically with the master node, as well as with client code and other computing nodes from time to time. The compute node also maintains an open socket server through which client code and other compute nodes can read and write data, which is also reported to the master node.

III. 2 the underlying architecture of big data's scientific research platform

The low-level architecture of big data scientific research platform is based on the commercial version of Hadoop independently developed by our company, including functional modules such as big data analysis, data mining and machine learning, and takes HDFS and Hbase as the storage basis.

Figure 2. Software architecture

III. 2.1 distributed persistent data storage-HDFS

Hadoop distributed file system (HDFS) is designed to be suitable for distributed file system running on general hardware. It has a lot in common with existing distributed file systems. But at the same time, the difference between it and other distributed file systems is also obvious. HDFS is a highly fault-tolerant system that is suitable for deployment on cheap machines. HDFS can provide high-throughput data access, which is very suitable for applications on large-scale data sets. HDFS relaxes some of the POSIX constraints to achieve the purpose of streaming file system data.

III. 2.2 distributed real-time database-HBase

HBase is a distributed, column-oriented open source database. This technology comes from the Google paper "Bigtable: a distributed Storage system for structured data" written by Fay Chang. Just as Bigtable takes advantage of the Google file system, which provides distributed data storage, HBase provides BigTable-like capabilities on top of Hadoop. HBase is a subproject of Apache's Hadoop project. Different from the general relational database, HBase is a database suitable for unstructured data storage. Another difference is that HBase is column-based rather than row-based.

III. 2.3 distributed resource scheduling management-YARN

Yarn is the MapReduce framework of Hadoop2.0. The essence of YARN hierarchical structure is ResourceManager. This entity controls the entire cluster and manages the allocation of applications to underlying computing resources. ResourceManager carefully arranges each part of the resource (compute, memory, bandwidth, etc.) to the underlying NodeManager (per-node agent of YARN). ResourceManager also allocates resources with ApplicationMaster and launches and monitors their underlying applications with NodeManager. In this context, ApplicationMaster takes on some of the roles of the previous TaskTracker, and ResourceManager assumes the role of JobTracker.

III.2.4 Interactive SQL engine-Hive

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, provide simple SQL query function, and transform SQL statements into MapReduce tasks to run. Its advantage is that the learning cost is low, simple MapReduce statistics can be quickly realized through SQL-like statements, and there is no need to develop special MapReduce applications, so it is very suitable for statistical analysis of data warehouse.

III. 2.5 memory computing-Spark

Spark is a general parallel computing framework like Hadoop MapReduce, which is open source by UC Berkeley AMP lab. Spark has the advantages of Hadoop MapReduce, but unlike MapReduce, the intermediate output of Job can be saved in memory, so there is no need to read and write HDFS, so Spark can be better applied to MapReduce algorithms that need iteration, such as data mining and machine learning.

III. 3 functions of scientific research platform

III. 3.1 Scientific research project management

In the scientific research platform, scientific research computing is saved by computing projects, including the establishment of computing projects, maintenance of computing projects, design of computing projects, operation of computing projects and visualization of results. From a technical point of view, the computing project also includes algorithm components, algorithm processes and data sets, once designed, it can be used for computing, and later can adjust the algorithm and calculate based on new data resources.

After the completion of the calculation project, the algorithm model can be trained, and the trained model can be used to predict the data in the new calculation project, and the algorithm can be used many times in one training.

III. 3.2 the built-in dataset of the platform

In scientific research, how to obtain a large amount of high-quality big data resources is the biggest difficulty. At present, it is difficult to find the data sources needed for scientific research in channels such as the Internet, especially the high-quality data after data cleaning and governance.

The data supermarket platform uses the following models to provide high-quality data resources for scientific research in colleges and universities through external resources:

The main results are as follows: 1) through the mode of business cooperation, we can communicate flexibly with the owners of data directly, and obtain the authorization of data use for scientific research.

2) invite high-quality third-party data service providers in the industry to enter the data supermarket platform

3) through the way of data collection, after data source search, collection, management and cleaning, the data resources with public copyright are introduced.

All the imported data will be strictly reviewed by the data engineer to ensure the cleanliness and quality of the data, and can be directly used in data calculation.

For example, the patent data built into the platform, including nearly 20 million domestic business data, are constantly updated and can be directly used in scientific research in all aspects of tourism. In the database currently provided by the industry in other areas, the data supermarket directly provides the original data, which can be used for in-depth data analysis and economic forecast.

III. 3.3 upload of scientific research data

The existing data of scientific research teachers can be uploaded to the platform to participate in data calculation, and teachers can create data tables on the platform, and then upload local data files to the data tables. You can also maintain external JDBC data sources, and the platform will automatically extract external data to the platform for calculation and prediction.

III. 3.4 Integrated algorithm components

In order to facilitate scientific research teachers to process, analyze and calculate scientific research data quickly, the data supermarket platform integrates more than 50 kinds of general big data algorithm components. it includes regression algorithm, classification algorithm, clustering algorithm, association planning algorithm, recommendation algorithm, prediction and evaluation, data preprocessing algorithm, machine learning and so on. All algorithms do not need to be reprogrammed, but can be calculated by dragging and dragging the drawing, as shown in the following figure:

The algorithm components can be configured to achieve powerful custom computing functions and effects, and the adjusted model can complete the data analysis and prediction needed by teachers.

III. 3.5 Visualization function of scientific research platform

Provide more than 20 visual display modes, one-click selection, one-click switch, you can show the beauty of big data according to the needs of users, display the corresponding latitude according to the needs, and generate high-quality PNG files with one click, which can be saved locally for scientific research reports and papers.

four。 Platform dataset inventory

In order to facilitate users to quickly carry out scientific research and generate scientific research data reports, the scientific research platform provides some general data sets, including all kinds of standard scientific research data.

The platform also has built-in hundreds of optional data sets, divided into multiple packets, the total amount of which is close to 10TB, and it is still increasing as the business and collection work progresses.

five。 Customized data service

According to the needs of scientific research teachers, the data supermarket platform provides customized data introduction modes such as data collection and business cooperation. After the data is introduced, it can be directly introduced into the data supermarket and used by teachers.

If the teacher needs the tourism service evaluation data to analyze and predict the service situation, he can put forward the data demand directly through the data customization module in the data supermarket, and after being summarized by the data supermarket platform administrator, the data can be prepared through the data supermarket platform and handed to the teacher to use.

six。 Algorithm list of scientific research platform

There are 72 algorithms integrated on the platform, all of which come from scientific research websites. After verification by commercial organizations, distributed optimization has been completed after the introduction of the platform, which can be executed efficiently. Details are as follows:

You can view the specific algorithm in the scientific research platform, including the introduction, input, output and use of the algorithm and applicable scenarios and other information.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.