In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces the example analysis of data computing middleware technology in the database, which has a certain reference value. Interested friends can refer to it. I hope you will gain a lot after reading this article. Let the editor take you to know it.
The problem of big data's structure in traditional Enterprises
The image above is a familiar open source big data architecture based on the Hadoop system. In this architecture, it can be roughly divided into three layers. The lowest layer is data collection, which usually uses kafka or Flume to deliver web logs through message queues to the storage layer or computing layer. For data storage, the Apache community currently provides a variety of storage engine choices, in addition to the traditional HDFS files and H, but also provides Kudu, ORC, Parquet and other column storage, you can choose according to your own needs and characteristics. In the data computing layer on top of this, the choice is even richer. If you want to make real-time recommendations, you can use stream computing engines such as Storm and Spark Streaming to process the data passed by Kafka or Flume in real time. If you want to do customer portraits, you can use the machine learning algorithm in Mahout or Spark LMlib for classification. If you want to see the sales rankings for the day, you can use H, Impala or Presto. If you want to do a more complex funnel analysis of the sales of some goods, it may be more appropriate to use HIVE or Spark.
Of course, according to your own needs, you can overlay Redistribution cache, ElasticSearch full-text search, or products like MongoDB or Cassandra. Therefore, we will find that in fact, there is no particularly mature framework for big data's computing. Most of what we do is to constantly innovate, improve and revise some problem points, and then find ways to integrate several products. This is because as an emerging field, big data's computing technology accumulation is not enough, there are still many difficulties have not been overcome, is still in a growing stage. In big data's technological development and innovation, Internet enterprises are leading the trend. At present, a large number of big data technology products are sought after, mostly by Internet enterprises. As the cornerstone of big data's technology, the basic idea of Hadoop is based on Google's Map/Reduce and Google File System,Presto from Facebook. Cloudera, which contributed Impala and Flume, is not an Internet company, but it has strong Internet genes. Domestic Internet companies such as BAT have also made great contributions to big data's open source community.
But this also brings a problem, that is, these big data products, that is, the architecture is aimed at Internet enterprises because of the needs and scenarios design. Although these requirements and scenarios are universal, but in the overall IT architecture of enterprises, traditional enterprises are very different from Internet enterprises.
First of all, there are great differences in the allocation of professional and technical personnel between traditional enterprises and Internet enterprises. Internet enterprises gather a large number of high-level computer software design, development and maintenance personnel, which most traditional enterprises do not have. One of the differences here is in quantity. In traditional enterprises, an information center with hundreds of technical personnel is already a considerable team; while the technical personnel of Internet enterprises often have thousands of people, and enterprises like BAT have tens of thousands of technical personnel for development and maintenance. Another difference is in quality. Internet enterprises usually have a special team of platform support experts who have the ability to repair BUG in open source products in time to ensure the stable operation of system services. Due to salary and other reasons, it is often difficult for traditional enterprises to recruit top developers who master the core technology of open source products. This is a hidden danger to the use of open source products. Once there are BUG and other problems in open source products, no one can deal with them in time, which will cause great losses to the production services of enterprises.
Secondly, the IT architecture of traditional enterprises is also very different from that of Internet enterprises. Internet enterprises have a relatively short history, and they have the genes to develop their own applications based on open source software. each enterprise knows all kinds of technical details and business logic very well, and big data's system is even closely related to the business system. There won't be too many integration problems. However, traditional enterprises often have a long history, and their construction in IT has gone through a variety of technical routes, and there are often a large number of legacy systems with inconsistent architectures. Many enterprises have built enterprise data warehouses in the past, and now they are starting to build big data platform, but there is no particularly strict division between them, which not only causes a lot of functional overlap, but also causes a lot of data redundancy. Many data will retain multiple copies in different systems, and even many enterprises need to transfer the same data back and forth frequently in different systems. This brings serious integration problems.
Third, compared with Internet enterprises, the amount of data of most traditional enterprises is actually not that large. Compared with Google's more than 100000 searches per second, Alipay has more than 250000 transactions per second, the vast majority of traditional enterprises really do not have that large amount of data, and may not become an insurmountable problem. For such a large amount of data, traditional technology may be able to solve it, and it is not necessary to use a heavy architecture like Hadoop. In order to mine the value of these data, the complex environment of multi-source and heterogeneous may be a more troublesome problem.
By other's faults, wise men correct their own
Sometimes, when considering a solution to a problem, it is a good start to learn from solutions to similar problems.
In fact, in the field of trading applications, there has been a similar situation. All kinds of application systems run in the enterprise, these applications are developed by different developers, the technical routes, architectures and standards followed are very different, resulting in an isolated island of information and some information that needs to be shared. can not be exchanged between systems, resulting in a lot of information lag and data inconsistency.
So were these problems solved later? How did you solve it? -someone invented middleware.
No one has made a scientific definition of what middleware is. Generally speaking, it is a concept put forward to solve the problem of distributed heterogeneity, which is located between the platform (hardware and operating system) and applications, and provides general services for both or multiple parties. These services have standard program interfaces and protocols. For different operating systems and hardware platforms, they can have multiple implementations that conform to interface and protocol specifications. Solving multi-source heterogeneity is not the only reason for the emergence of middleware, but it is an important problem of heterogeneity that it solves. Generally speaking, middleware has the following characteristics:
1. Meet the needs of a large number of applications
two。 Runs on a variety of hardware and OS platforms
3. Support for distributed computing, providing transparent application or service interaction across network, hardware, and OS platforms
4. Protocols that support standards
5. Support for standard interfaces
In other words, the main role of middleware is to establish a standardized interactive interface across platforms. According to the different application scenarios, open source middleware is divided into network communication middleware, RPC middleware, message middleware, transaction middleware, Web middleware, security middleware and so on. These different middleware have different functions and implementation methods, and play different roles in their respective fields, but they all meet the characteristics listed above and have the basic functions described above.
So why not consider using middleware technology in the field of data applications?
Data computing middleware
Why is the concept of data computing middleware proposed? Because in the process of developing data applications, people are usually plagued by the following problems.
-requires cross-system and cross-platform operations to calculate data from different data sources together
-the demand changes frequently, the new demand is constantly emerging, and the old demand is constantly modified
-too tight coupling of business logic and data
-complex computing is difficult to implement and poor execution performance
By setting up heterogeneous data computing middleware, the problem of fusion computing in multi-source heterogeneous environment can be well solved. Of course, it is not enough to solve the interactive access of heterogeneous data. To solve the above difficulties, data computing middleware also needs to be able to provide efficient development efficiency, excellent computing performance and convenient code management capabilities. Taken together, we can evaluate the data computing middleware from the following aspects.
-compatibility (Cross-platform)
Compatibility here mainly refers to cross-platform data access. We mentioned earlier that the unique feature of traditional enterprise IT systems is that there are a large number of heterogeneous systems, and the interoperability between these heterogeneous systems is very poor. The primary task of data computing middleware is to break through this barrier and play a connecting role, integrating data from different heterogeneous platforms.
-Hot deployment (Hot-deploy)
One of the characteristics of data application is that the demand changes rapidly, and our requirements for data analysis are endless, always exploring new goals, and always hoping to mine more information from the data. Therefore, the demand change of data application is the constant norm of heterogeneity. This puts forward new requirements for the deployment of applications. If the service needs to be stopped every time a new functional module is deployed, it is bound to have a great impact on the quality of service. If the application module can be hot-swappable, there is no need to stop the whole service, and the modules are isolated from each other, then the operation of the application will be smoother and the quality of service can be guaranteed.
-High performance (Efficient)
The performance of data computing processing is also very important for data computing middleware, even if the amount of data of traditional enterprises is not as large as that of Internet enterprises, the data that data applications need to deal with is of considerable scale. high computing performance is an important heterogeneous index to evaluate data computing middleware. Although there are no heterogeneous hard performance metrics, we always want to process as quickly as possible.
-Agility (Agile)
Agility here emphasizes the aspect of development. It is precisely because the requirements of data applications will continue to change, so development will also be an ongoing task, which will not remain the same for quite some time as traditional business applications. The agility of development can ensure that data applications can complete the delivery and use of new functions in the shortest possible time, which may avoid huge losses for enterprises in some specific scenarios.
-scalability (Scalability)
Data computing middleware needs good scalability and supports distributed computing. With this ability, data computing middleware can calmly face the scenarios of different data volumes in the actual application environment. And when the data volume business continues to grow, it still ensures the continuous availability of various data services provided by itself.
-Integration (dable)
As a middleware, integration is also necessary. The integration here includes two aspects, one is the integration of third-party software, the other is to be integrated into third-party software. The scenarios of data applications are very diverse, and only with strong integration can they be applied in more environments.
These are several key considerations in our evaluation of data computing middleware, which can be referred to as CHEASE for short. If the six aspects corresponding to CHEASE are well satisfied, then this is an excellent data computing middleware.
Dry aggregator
Data computing middleware is a brand-new concept. Among the current data computing products, the closest one is the aggregator. The aggregator is a lightweight big data fusion computing platform independently developed by Beijing Runqian Information system Technology Co., Ltd., a new computing engine designed and developed for structured and semi-structured data. The design goal of the aggregator is to try to solve the efficiency of describing the calculation and implementing the calculation. The aggregator has the following characteristics.
1. In order to achieve the design goal, Runqian Company first designed a process-oriented scripting language SPL (Structured Precessing Language) for the aggregator, which can easily describe the calculation process of data. The aggregator uses new data and calculation models, and provides a wealth of basic calculation methods, especially suitable for complex multi-step operations of business rules, making the calculation itself easy to describe, so as to improve the efficiency of code development.
two。 The aggregator has done a lot of optimization work in the internal calculation implementation, and the optimization of these algorithms greatly improves the speed and the efficiency of calculation when sorting, summarizing and associating the data set.
3. The aggregator has a large number of built-in data access interfaces, which can easily connect to various data sources and obtain data from them. Supported data sources include, but are not limited to:
-Commercial RDBMS:Oracle, MS SQL Server, DB2, Informix
-Open source RDBMS:MySQL, PostgreSQL
-Open source NOSQL:MongoDB, Redis, Cassandra, ElasticSearch
-Hadoop family: HDFS, HIVE, H
-Application software: SAP ECC, BW
-File: Excel, Json, TXT
-others: http Restful, Web Services, multidimensional database supporting OLAP4j, Aliyun
4. SPL is an interpreted language and does not need to be compiled. This makes it very convenient for the task script of the aggregator to be deployed inside the aggregator, and it is very convenient to realize dynamic hot deployment.
5. The aggregator provides the ability of parallel multithreaded computing and cluster distributed computing, and the nodes of the cluster can be added dynamically, which has excellent scalability.
6. The core function of the aggregator is realized by several Java JAR packages, which is short and short, with super integration, flexibility, expansibility, openness and customizability, and is very easy to be deeply integrated with Java applications. In addition, it provides JDBC, Restful, Web Services and other standard interfaces, which makes it very easy to integrate with third-party applications.
The above six characteristics precisely correspond to the six aspects of CHEASE. Although the concept of data computing middleware has not been put forward at the beginning of the design of the dry aggregator, the design purpose of the whole product is always around CHEASE, so it is quite balanced in the aspects of compatibility, hot deployment capability, computing performance, agility, scalability and integration. If you feel you need a data computing middleware in your data computing architecture, the aggregator is probably the only option at the moment.
Some difficulties to be solved
Of course, the concept of data computing middleware has just been proposed, the aggregator is also a new product, the concept needs to be constantly verified and improved, and the product will certainly have a lot of shortcomings. At present, the difficulties can be seen from the following two points.
-get the performance of data
Data application is different from other applications, it always involves a large number of data reading, so the performance of data reading is very important. The performance of data reading depends not only on the data computing middleware itself, but also on the data source and interface type. Through the standard interface such as JDBC, there is no problem with data access, but it is difficult to meet the performance requirements of data applications in terms of reading speed. For this problem, moisten provides the aggregator with multiple formats of internal file storage as a data caching mechanism to speed up computing, which is a very practical compromise. At the same time, Runqian is also trying to develop targeted high-performance interfaces to improve the speed of obtaining data from the outside. Of course, there are many interfaces involved in data computing middleware, so it is a great challenge to solve this problem.
-support for machine learning
Nowadays, everyone is talking about machine learning, although traditional data analysis is still the mainstream, and in most fields, machine learning is not mature, and the effect of practical application is not satisfactory. However, it is undeniable that machine learning is the future direction and will be an indispensable part of data application. Therefore, the function of machine learning should be necessary for data computing middleware. At present, the aggregator does not have the ability of machine learning, which limits its use. Of course, the aggregator itself is developing and can be seen in the future.
Thank you for reading this article carefully. I hope the article "sample Analysis of data Computing Middleware Technology in Database" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.