In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
What this article shares with you is about the performance of Milvus in streaming data scenarios. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it with the editor.
Milvus, as an open source feature vector similarity search engine, has been used by hundreds of enterprises or organizations around the world since its open source for half a year. These users are involved in various fields, including finance, Internet, e-commerce, biopharmaceutical and so on. In the production scenarios of some users, most of their data are generated continuously and dynamically, and these dynamically generated data are required to be retrieved quickly after they are stored in the database.
Big data processing can be divided into two types: batch big data (also known as "historical big data") and streaming big data (also known as "real-time big data"). In most cases, streaming data has a significant advantage in dealing with dynamic new data that is continuously generated. Stream data refers to data continuously generated by multiple data sources, usually sent in the form of smaller data records at the same time, about thousands of bytes. Stream data can be various forms of data, such as online shopping data, social networking site information, geospatial services, and data obtained through remote sensor measurement and control.
| | Milvus application |
Driven by the needs of users, Milvus continues to increase its functions and explore more application scenarios. Milvus dynamic data management strategy enables users to insert, delete, search and update data at any time without being bothered by static data. After inserting or updating data, the inserted or updated data can be retrieved almost immediately. Milvus can ensure the accuracy of search results and data consistency. At the same time, in the continuous process of data import, Milvus can still maintain excellent retrieval performance. Because of these features, Milvus can be well applied to the scene of streaming big data.
In many user scenarios, batch big data and streaming big data are combined to build a hybrid mode to maintain real-time processing and batch processing at the same time. For example, in the implementation of the recommendation system, whether it is articles, music, video and other recommendations or e-commerce platform recommendation, there are a lot of historical data. When the platform makes recommendations to users, part of the historical data still has the value of being recommended, so these historical data need to be re-processed, filtered and then stored in Milvus. In addition to the historical data, new data is generated every day in the recommendation system, including new articles, hot spots and new goods, which should be imported into the database in time and required to be retrieved quickly. these continuously generated data are streaming data.
As more and more users have the need of dynamic data insertion and real-time retrieval, this paper will introduce the parameter configuration and retrieval performance of Milvus based on Kafka in streaming data scenarios.
| | scene simulation |
Kafka is an open source stream processing platform. Here we will introduce two application examples of Milvus based on Kafka under streaming data.
Example one
In this system, Kafka is used to receive the data generated by each client to simulate the generated streaming data. When there is data in the Kafka message queue, the data receiver continuously reads the data from the Kafka queue and immediately inserts it into the Milvus. The data volume of inserting vectors in Milvus can be large or small. Users can insert ten vectors at a time, or hundreds of thousands of vectors at a time. This example is suitable for scenarios with high real-time requirements of data. The whole process is shown in the figure:
Configuration:
Index_file_size: in Milvus, data is stored in separate files, and each data file size is defined by the parameter index_file_size value when the collection is created. After the data is written to disk, it becomes the original data file, which saves the original data of the vector. Whenever the size of the original data file reaches the index_file_size value, it will trigger the establishment of the index, and after the index establishment is completed, an index data file will be generated.
When searching by Milvus, it will be retrieved in the index file. Data that is not indexed will be retrieved in the original data file. Because the retrieval is slow for parts that are not indexed, it is not appropriate to set index_file_size too large, which in this example is set to 512. If the index_file_size is too large, it will make the unindexed data files larger and reduce the retrieval performance. )
Nlist: this value indicates how many "clusters" the vectors in each data file are divided into after Milvus is indexed. In this example, the value is set to 1024.
In the process of constantly inserting data, Milvus will continue to build indexes. In order to ensure the efficiency of retrieval, we choose to use GPU resources to establish the index and use CPU resources for retrieval.
Performance:
In this example, before continuously importing the data, 100 million 128D vectors are inserted into the collection and an IVF_SQ8 index is established to simulate the historical data. After that, 250-350 vectors are continuously inserted into the set at random intervals of 1-8 seconds. Subsequently, a number of searches are conducted, and the retrieval performance is as follows:
In the above performance records, the first retrieval time refers to the retrieval time after each new data import, and the second retrieval time is the retrieval time before there is no new data import after the first retrieval.
By horizontal comparison, it is found that the first retrieval takes longer than the second because the newly imported data is loaded from disk into memory during the first retrieval.
From a vertical point of view, in the process of continuous data import, the time spent on the first retrieval continues to increase. This is because during the continuous import of data, the new data files are merged with previously unindexed data files, and the newly merged data files are loaded from disk into memory during retrieval. As more data is imported, the merged new file will become larger and larger, and the time it takes to load from disk to memory will increase. Secondly, this part of the imported data is not indexed, and with the increase of unindexed data, the retrieval time in this part of the data will gradually increase. The second retrieval time is getting longer and longer, but the time-consuming increase is smaller than that of the first one. This is because the second retrieval does not load data from disk into memory, and it takes more time only because more and more data is not indexed. The threshold for indexing is triggered when data is imported into about 1 million items (512 MB per data file, 128D vectors, so each data file is about 1 million vectors). When the index is established, the retrieval is carried out in the index file, so the second retrieval time at this time returns to the performance before dynamically importing data.
During the continuous import of data in this example (the cumulative import is about 1 million), the query is sampled every 5 seconds, and the query time is recorded. The query performance trend of the whole process is shown in the following figure. The ordinate indicates that the query takes time, and the Abscissa represents the time of the entire query process in seconds.
In this line chart, most of the points (the points above) correspond to the first retrieval time in the above table. As can be seen from the chart, the first retrieval time after importing data has a large upward trend. A few points (the points below in the figure) correspond to the second retrieval time in the above table, and the second retrieval time has a slight upward trend. In this example, because the data is imported frequently, the retrieval is more likely to be retrieved after the new data is imported. You can also see from the above figure that when the total amount of imported data reaches the threshold of indexing, the query time after indexing returns to the level before the dynamic import of data.
At the same time, after testing, the newly inserted data can be retrieved in a second or two.
Example two
In this system, Kafka is used to receive the data generated by each client to simulate the generated streaming data. When the data in the Kafka queue arrives, the data in the Kafka is read, and when the data is accumulated to a certain amount (100000 in this example), it is inserted into the Milvus in batch, which can reduce the number of inserts and improve the overall retrieval performance. This example is suitable for scenarios that require less real-time data. The process flow is shown in the figure:
Configuration: the configuration of this example is the same as that of example 1.
Performance: the query takes about 0.027 seconds before importing new data. During subsequent imports, 100000 pieces of data are inserted in batches at a time. In the process of data import, the first retrieval time and the second retrieval time after data import are about the same as those shown in the table of example 1. Since there are no frequent data import operations, most of the retrieval time corresponds to the second retrieval time in the above table.
During the continuous batch import of data in this example (the cumulative import is about 1 million), the query is sampled every 5 seconds, and the query time is recorded. The query performance trend of the whole process is shown in the following figure. The ordinate indicates that the query takes time, and the Abscissa represents the time of the entire query process in seconds.
As can be seen in the line chart, due to the reduction of insertion frequency, most searches correspond to the second retrieval time in the sample table. Only after importing 100, 000 data each time, the retrieval time is relatively long. Similarly, after indexing, the query time returns to the level before the data was imported.
From the performance line chart of the above two examples, when there are frequent retrieval operations and low real-time requirements for new data, cumulative batch data insertion is a better choice.
| conclusion: with the increase of users, Milvus has been put forward more and more requirements. Driven by user needs, Milvus will continue to improve and enrich its own functions. Milvus is committed to providing more value to users on the road of unstructured data processing. At the same time, we also hope that more like-minded partners will join the Milvus open source community to participate and witness the growth of Milvus. The above is the performance of Milvus in streaming data scenarios. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.