In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Based on what is the technical principle of FineBI big data engine under Hadoop architecture, I believe many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.
With the continuous increase of various business systems and the increasing amount of data in each business system, the analysis demands of business users are more and more and change rapidly, and the work of IT data support becomes more and more complex.
The main contents are as follows: 1. The data comes from many different systems, and there are some problems, such as the need for cross-data source analysis, the need to connect different data sources and so on.
2. The volume of data that needs to be analyzed is getting larger and larger, and the analysis results should be obtained quickly.
3. Some of the data still need to be processed again.
The supplier does not seem to have any operation at the front end of the business system, but the logic behind it is very complex and difficult to implement. What you can see is the tip of the iceberg, but what you can't see is most of the support under the sea.
In order to solve the increasing demand for large amount of data analysis, most companies will build such a set of big data analysis platform by building big data structures such as Hadoop and Spark, together with BI tools for data-level analysis.
A key point of big data's analysis is performance: whether the number is fast, whether the analysis response is fast, and whether it can be real-time?
In addition to the underlying architecture of the platform, the performance of BI (business intelligence) is also very relevant.
People may generally think of BI as a data presentation tool, which does not seem to have many technical operations at the front end, but the logic behind it is very complex and difficult to implement. What you can see is the tip of the iceberg, but what you can't see is most of the support under the sea.
All good BI tools have dependent data engines. On the one hand, the role of the data engine is the performance of data response (data volume, rate), and a very important point is whether it can adapt to different business situations of the enterprise. For example, fast reading of small data, big data distributed parallel computing, real-time display of node data and so on.
FineBI v5.0 is a tool that can support the above requirements, relying on the Spider big data engine.
The Spider high-performance engine can support the rapid drag-and-drop analysis and display of 1 billion-magnitude data in the front end of BI, and has a highly available architecture design to ensure that the data engine can support business analysis throughout the year.
The past Life and present Life of Spider engine
Why is it called Spider engine? It sounds like crawler software, but what does it have to do with data analysis?
One is the literal translation of the meaning-spiders, from the spider is easy to associate with the web. From the point of view of networking, there are two meanings: one is to connect all the existing engine functions together, because 5.0 engine realizes the docking and flexible switching between real-time data and extracted data; the other is the more important distributed mode of 5.0 data engine, which is a framework composed of various components, and networking means to connect these components.
The second is a convertible from a homophonic Ferrari. Sports cars are fast. The sports car is lengthened and widened to make it more stable, functional and safer. It happens to coincide with our data engine concept.
Therefore, it is named Spider engine.
Let's talk about its history.
The data engine of FineBI developed from the cube/FineIndex engine which did data extraction at first to the directly connected engine / FineDirect engine later. Then in 2016, it rapidly expanded to more than 60 distributed engines used by customers from 17 to 18 years. The engine function and the amount of supporting data are making continuous progress with the development of the times. However, there are many types of engines, and users' understanding and use are problems.
Therefore, to the v5.0 version, the engine is unified, and the Spider engine includes all the previous engine functions, the extraction data and real-time data can be switched between each other, and the local mode can be extended to a distributed mode according to the amount of data, which makes it easier to use and understand.
Highlights:
(1) the front end of the engine support can quickly display and analyze, and truly realize 100 million-level data and second-level display.
(2) users can freely choose the way of real-time or extraction according to the amount of data, real-time requirements and frequency of use, so as to flexibly meet the needs of real-time data analysis and large amount of historical data analysis.
(3) the high-performance incremental update function of extracting data can meet a variety of data update scenarios, reduce the data update time and reduce the pressure on the database server.
(4) reasonable engine system architecture design can ensure that there is no failure in the whole year and can be used normally throughout the year.
In terms of data source support, conventional data sources can support it, so there is no need to worry about data source support.
3. Directly connected mode (Direct Mode)
The direct connection mode of Spider engine can be directly connected to the database to do real-time big data analysis. The operation of user drag-and-drop analysis at the front end of FineBI is transformed into a processed query language in real time to realize the effect of real-time analysis of the data of enterprise database. In the case of high real-time requirements or excellent database computing performance, this mode can be adopted to realize real-time query and calculation of data, and give full play to the performance of database computing.
The real-time data of the direct connection mode can be flexibly converted with the local mode and the extracted data under the distributed mode, a large number of historical data use extracted data, and the real-time data with higher real-time requirements use real-time data, and the data of the two modes can be displayed on the same DashBoard page at the front end, which is convenient for users to analyze the data flexibly.
Detailed explanation of the underlying technology
What are the underlying technical details? take a closer look at the following introduction:
1. Column data storage, dictionary compression
The storage of the extracted data is listed as a unit, and the same column of data is stored continuously, which can greatly reduce the Icano in the query and improve the query efficiency. And the column data stored continuously has greater compression unit and data similarity, which can greatly improve the compression efficiency.
two。 Intelligent bitmap index
Bitmap index, namely Bitmap index, is a common technology to speed up filtering when dealing with big data, and bitmap index can be used to realize concurrent computing with large amount of data.
Suppose you have the following table:
4. Asynchronous data import
In the process of data extraction and import, JDBC starts to perform data compression work when data extraction is done, and the compression work will not hinder the action of decimation. When compressing, the data is sliced so that the amount of compression is not too large and the resources are very small. At the same time, independent compression threads work at the same time of extraction, and parallel processing reduces data extraction time. Combined with the optimization of data storage, the massive data import avoids the performance problems such as OOM.
The following figure shows the import process of a 100-column, 1 billion-row data table (in which the long string table does not repeat more than 100 million rows), keeping the memory below 4G with remarkable results (use Jprofile to record resource usage screenshots).
5. Distributed file storage system
The two important modes of Spider engine (local mode and distributed mode) are to do data extraction, so the data storage medium is very important. In the case of a small amount of data, in order to be light and easy to use, the local disk is directly used as the storage medium, the data is together with the application, and there is no network transmission time.
In the large amount of data storage, it is necessary to have a cheap storage mode, which can store unstructured data and do distributed computing. First of all, think of the distributed file system in Hadoop-HDFS. The stability and fault tolerance mechanism of HDFS are relatively perfect. After version 2.x of Hadoop, HA is supported, so that the stored data can be used all year round. Naturally, its ecology in the field of big data is also relatively good, so we introduce it as a long-term redundant backup storage system.
However, the storage of HDFS is still based on disk, and its Imax O performance is difficult to meet the delay required by streaming computing. Frequent network data exchange further hinders the computing process. Therefore, we introduce Alluxio as the core storage system of the distributed storage system. The memory-centric storage feature of Alluxio makes the data access speed of upper-layer applications several orders of magnitude faster than the existing conventional schemes. Making use of the hierarchical storage characteristics of Alluxio, a variety of storage resources of memory, SSD and disk are used comprehensively. Caching strategies such as LRU and LFU provided by Alluxio can ensure that hot data is always kept in memory, cold data is persisted to level 2 or even level 3 storage devices, and the lowest layer HDFS is used as a long-term file persistence storage system.
6. Data localization computing
Distributed computing is a combination of multi-machine computing, then multiple machines must have the problem of data transmission between machine nodes. In order to reduce the consumption of network transmission and avoid unnecessary shuffle, the scheduling mechanism of Spark is used to realize data localization computing. On the premise of knowing the location of the data needed for each task, the task is assigned to the node with computing data, which saves the consumption of data transmission, so that the computing speed of large-scale data can also achieve the effect of seconds.
7. Intelligent caching
Intelligent cache is more for the case of direct connection mode (Direct Mode), the system can also effectively support concurrent queries. As a result of direct access to the database, performance is inevitably limited by the database. At the same time, users analyze that there will be query scenarios for the same data, so the encache framework is introduced to do intelligent caching, and there are multi-level caching and intelligent hit strategies for the operations after returning data to avoid repeated caching, thus greatly improving query performance. The following is the comparison between the first query and the second query.
After reading the above, have you mastered the technical principle of FineBI big data engine based on Hadoop architecture? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.