What are the problems of short-term clustering? 04/18 Update SLTechnology News&Howtos

What are the problems of short-term clustering?

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "what are the problems of short-term clustering". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

I. introduction of background and questions

Search is a scene in which users clearly express their needs. Baidu search undertakes a large number of requests every day, and the expression of query is varied. The core purpose of large-scale short text clustering system is to summarize the short texts represented by query, * * accurately and efficiently express short texts with the same meaning and express different short texts, condensing them into "requirements" with clear meaning cohesion. * * this method can not only compress short text, but also better meet the needs of users, help content providers, and provide better content. At present, it has assisted the content production of Baidu UGC products.

In this section, we first introduce the short text clustering problem, and introduce the common clustering algorithms, and finally analyze the difficulties and challenges of the clustering algorithm in the search scenario.

1.1 short-term clustering problem

Clustering is a commonly used unsupervised algorithm, which divides a data set into different classes or clusters according to a certain distance measurement, so that the similarity of data objects in the same cluster is as large as possible. at the same time, the difference of data objects not in the same cluster is as large as possible.

Generally speaking, the search query is mainly short text. From a large number of short texts, all the texts with the same meaning are aggregated as much as possible to form a * "requirement cluster", which is the problem of text clustering.

For example:

Among them, query= "what if the screen is broken?" And query= "ask for help, the screen is broken" is the same need, can be aggregated into clusters; query= "what do you mean", query= "what do you mean" and query= "what are the allusions to look up" are the same needs, can be aggregated into clusters.

It can be seen that there are two obvious differences between the short-term clustering problem in the search query scenario and the conventional clustering problem: * * the connotation of "cluster" is relatively concentrated, that is to say, after the aggregation is completed, the order of magnitude of the cluster is relatively small, which is close to the data in the same cluster.

1.2 constant algorithm and framework simhash

In the field of classical clustering, simhash is a common local sensitive hash algorithm, and it is often used as a repeating algorithm in search engines. It maps similar files to the same hash value with a higher probability, from clustering to clustering (de-duplication).

The simhash algorithm generally processes the "file" into word segmentation, gives the weight of each word, calculates a hash value for each word, weighs and sums the hash values of all words, and then uses the counterpoint value to reduce the dimension, and produces the hash value of the final document. The importance of the words is weighted in the process, so the important words will have more influence on the hash value of the final file, and the unimportant words will have more influence on the hash value of the final file. From similar files, it is more likely to produce the same hash value.

In the field of clustering (de-duplication), simhash is an effective algorithm, but for short-term local clustering, the effectiveness of simhash is reduced due to the reduction of cost, which can not guarantee the accuracy of the output results.

Vectorized clustering

In another common "local clustering" formula, we will first vectorize the original, and then use the "conventional clustering" method to cluster. Among them, there are many "original vector" forms, from the early tf-idf, to the fast word2vec, and even to the original vector produced by the pre-training model such as bert,ernie, which can be used as the vector representation of the original. Generally speaking, the vectorization method implies the distributed hypothesis, and the resulting text vector can solve the problem of synonyms to a certain extent.

There are also a variety of clustering patterns, such as kmeans, hierarchical clustering, single-pass and so on. In particular, it should be noted that when designing the distance metric of clustering algorithm, we need to consider the vector formula, and design different distance metrics for different vector formulas.

There are three problems with this kind of algorithm:

1. The clustering algorithm represented by kmeans has super-parameter "number of clusters", which can only be determined by experience and auxiliary indicators. 2. For short-term clustering problems, the number of clusters is often very small, which will lead to a serious decline in the performance of clustering algorithms. 3. The accuracy of clustering results is affected by both vectorization and clustering algorithms, so it is difficult to express the fine-grained differences of data.

Other algorithms and matching

Although the short-term clustering algorithm has more response scenarios in the industry, it is still a relatively popular cluster in the research field, mainly adopting projects that improve the original vector representation, for example, sentence vectors are designed as the first principal component of the word vector weighting matrix, which is directly weighted; for example, the vector iteration formula guided by clustering is adopted to improve the vector representation, such as SIF+Aut. However, no matter which kind of improvement, it will increase the computational overhead, which is not realistic in the industry.

1.3 difficulties of short-scale local clustering

Our correct problem is "short-scale" local clustering, which also contains three implicit restrictions:

1. The accuracy of clustering results is often strict. 2. The scale of the short-term data is "constant". 3. The time limitation of data output is high.

It is not difficult to see that in terms of operational efficiency and accuracy, conventional algorithms are difficult to meet the above three requirements at the same time; industry does not have a mature framework to solve this problem; academic algorithms are still far away from the application of industrial scenes; we need a new way to solve the problem.

When designing an algorithm, the following issues need to be considered:

1. High data level and high system throughput: the order of magnitude of searching query is self-evident. For 100 million-level data, efficient calculation will test the ability of algorithm design and engineering architecture.

2. The clustering algorithm requires high accuracy: first of all, the clustering algorithm is an unsupervised algorithm, which uses the distance measure on the vector space to measure the clustering degree of the clustering results. In essence, there is no concept of accuracy. However, in the search query scenario, the clustering algorithm has a clear measure: through a unified text similarity evaluation standard, we can posterior evaluate whether the results are similar in the same cluster, and whether the data are not similar in different clusters, thus we can measure the clustering accuracy. This puts a "hoop spell" on the short text clustering algorithm: it is not suitable for any clustering algorithm, and the clustering algorithm which is more closely combined with text similarity needs to be considered.

"closer combination" is considered from the definition of distance in the vector space. For example, the similarity measure given by the similarity model is not necessarily the "distance" in the vector space. Specifically, a well-defined "distance" in a vector space needs to satisfy a triangular inequality: distance (AMagi B) + distance (BMague C) > = distance (AMague C). However, for the degree of similarity, similarity (AMague B) + similarity (BMague C) and similarity (AMague C) do not necessarily have a stable quantitative relationship. Therefore, the clustering model can not be used directly, and the similarity can only be used as the "off-site guidance" to realize the clustering algorithm under the guidance of similarity. As a result, general clustering algorithms can not be directly used in short text clustering.

3. Text similarity requires high precision and short time: text similarity is a scene dependence problem. For the same query pairs, there may be different results in different scenarios. In the search scenario, the accuracy of similarity is very high, often a word difference, that is, completely different requirements, so the model needs to be able to accurately capture the fine-grained differences of the text; on the other hand, we want to aggregate the query with the same requirements into a cluster as much as possible to reduce the missed recall; that is to say, for short text clustering, there are high requirements for the accuracy and recall of text similarity. * * in addition, in order to adapt to large-scale calls, text similarity service needs to have the characteristics of response time, easy to expand and so on.

4. Text representation is complex: text representation refers to the transformation of text into semantic vectors in some way for subsequent text clustering. Text vectors can be produced in a variety of ways, for example, in simhash, weighted hash functions are used to express text information, and word2vec word vectors can be used to generate text vector information. In addition, the categories of short text, keywords and other information is also an important part of text representation; in text vectorization, we need to focus on how to reflect similarity in vector space. Text representation is an important and basic algorithm. Choosing different text representations determines different similarity measures, which directly affects the selection of clustering models and indirectly affects the final results.

5. Error discovery and repair: from text representation to text similarity, and then to clustering algorithm, errors will be accumulated in each step, which will affect the final result. For large-scale text clustering, this problem is particularly obvious, and it is easy to be ignored.

1.4 the general idea of short-scale local clustering

Considering the above difficulties, we adopt the general idea of multi-level splitting and breaking down one by one.

Figure 1: the general idea of "short scale" local clustering

1. Let's first split the "short-scale" book into multiple levels to basically ensure that the query with the same meaning and the probability are put into the same bucket:

1.1 level split: you need to ensure that the split item is in the semantic level, and the meaning is mutually exclusive, that is, the query with the same meaning, must go into the same bucket.

1.2 "level split: after the" level "split, the order of query in each barrel is still very low, and it needs to be further split to ensure that the order of magnitude of the subsequent calculation can be controlled.

two。 Fine semantic aggregation of query feeds in the same bucket, merging query with the same meaning into clusters

3. For semantic clusters in the same level bucket, enter error check and merge the clusters that need to be merged.

Description:

1. Why split it? Assuming that our query quantity is of the order of magnitude, then in the worst case, we need to perform multiple similarity calculations in order to complete this clustering. However, if we split the data and ensure that the split is mutually exclusive with a relatively high probability, then only multiple similarity calculations are needed, and the order of magnitude is far lower than that of others.

two。 Why do you need multi-level splitting? If the original data is split into smaller levels, the number of similarity calls will be reduced, but the finer the level split is, the less semantic mutual exclusion will be guaranteed, which will lead to a surge in the number of errors that need to be checked. If we use multi-level splitting to ensure the accuracy of the top split, the amount of data for error checking will be reduced.

3. How to get into semantic clustering? After the data is split, you need to cluster the data in each bucket, which is also the most important part of the whole case. before the data is split, there are three ways to solve the problem:

3.1 search clustering based on "text": although the amount of data allocated to the same bucket has been greatly reduced, if the binary similarity is calculated, the data level is still very small. We find that the regular keyword search can guarantee the recall of similar query, and only need to calculate the correlation for the recalled query.

3.2 clustering based on this representation: although the similar query can be covered by keyword search, the recall is not complete. In order to solve the problem of incomplete keyword recall caused by synonym and sentence transformation, we use the recall method based on text representation: the vectorized query is clustered with fine granularity, and only the similarity degree is calculated for the query in the same class.

3.3 Retrieval clustering based on text representation: considering the weak control of fine-grained clustering algorithm, vector retrieval can be introduced to solve the problem of incomplete recall by means of text vector recall.

Validity analysis

1. It solves the problem of large data magnitude and short time-consuming: the current process can split the large-scale data into smaller data blocks and assign them to different computing units for calculation, thus reducing the time-consuming; we have estimated that on the 100 million data, the traditional pairwise comparison method takes 58 years, while the hierarchical method only takes 4 days. Although hierarchical vectorization clustering (no similarity) can reduce the time to 2 days, but the accuracy and recall rate are low.

two。 Optimize the similarity and improve the computing performance of the similarity service: we have customized the short text similarity, deployed multiple instances, and the throughput capacity of the similarity service has been improved 100 times.

3. The accuracy of the clustering algorithm is high: by introducing the error correction mechanism, the accuracy of the overall short text clustering algorithm is improved from 93% to 95%, and the recall rate is increased from 71% to 80%.

Based on the above analysis, we make a compromise between computing performance and computational effect, and we use hierarchical vectorization clustering + optimized similarity as the final scheme.

2. The evolution and thinking of short-scale clustering algorithm.

The above is the case of "solving the clustering problem of short scale" summarized by us after a period of exploration. In the process, we entered the iteration of two versions.

2.1 v1.0: clear ideas for split

In the initial stage of solving this problem, we are faced with two technical solutions: one is to explore the representation suitable for short text and transform the problem into a special vector clustering problem, such as simhash;, the other is to split the data set, reduce the data size, speed up the similarity calculation, and solve the clustering problem. After analysis, we think that in scheme 1, in selecting the appropriate text representation, there is no intermediate index that can guide the optimization direction, so it is difficult to iteratively optimize. Therefore, we make clear the idea of splitting the data set.

First of all, according to the classification of query and other business requirements, we split query at the first level to ensure that the semantic of the split result is mutually exclusive.

The second step is to carry out two-level split: in order to ensure that the level of data is moderate and it is convenient for fine semantic aggregation in the same bucket after the second-level split, we adopt a hierarchical bucket-splitting method:

1. The high-level semantic features of query are calculated and binarization is performed to produce a vector consisting of 0,1 in 256dimensions.

two。 First, take the first 50-dimensional vector for hash rough clustering; if the number in the cluster exceeds a certain number, then expand the dimension of the query to 100 dimensions, and re-carry out hash rough clustering until the dimension reaches 256or the number in the cluster is less than a certain number.

The third step is to make a fine semantic aggregation of the query in each bucket and merge the query with similar meaning into a cluster.

Analysis of advantages and disadvantages of v1.0

We can see that due to the idea of data splitting, after dividing the data into buckets, when fine semantic clustering is carried out, the data semantics between each bucket are mutually exclusive and only need semantic aggregation in each bucket to produce data. This method is convenient for parallel calculation and makes a great contribution to shortening the time-consuming calculation.

In the process, we also found that there are some areas that need to be improved:

1. The accuracy of clustering results strongly depends on the similarity model, and it is a fine-grained similarity model. If the coarse-grained similarity model is used, it will lead to false recall, so we need a model that can distinguish the degree of fine-grained similarity.

two。 Using hierarchical bucket for secondary splitting does not necessarily guarantee that semantically similar data enter a bucket. The query vector representation used in hierarchical split implies that the vector is gradually refined with the increase of dimension in semantic expression, but this guidance is not imposed in the process of producing query vector representation, which can not guarantee the validity of this hypothesis. In v2.0, we changed the way of secondary splitting to overcome this defect.

3. Lack of error detection and correction mechanism: no matter how accurate the similarity is, no matter how accurate the classification of short text is, there will be errors; we need a mechanism to find and correct errors.

2.2 v2.0: introduce fine-grained similarity

In response to the problems found in v1.0, we made three changes:

Introduce fine-grained similarity

A typical use scenario of clustering in this paper is to merge the search query at the semantic level, merging the query with the same needs and different expressions into a "requirement", so the standard for the identification of similar query is relatively strict. The existing similarity model is no longer applicable. After a detailed analysis of query, we find that parsing, phrase collocation and other features can greatly help to improve the accuracy and recall rate of the similarity model. Therefore, we have developed a set of similarity model which integrates syntactic analysis, phrase collocation and other features, and has achieved good benefits.

Introducing clustering model into secondary separation

In v1.0 version, the accuracy of hierarchical bucket separation can not be guaranteed, so a more accurate two-stage split method is needed to achieve the purpose of data bucket. Secondary splitting only needs to ensure that similar short texts are divided into the same bucket, and any two short texts in the bucket are not required to be similar. This setting is very suitable for using vectorized post-clustering method to solve the problem. On the one hand, considering the performance of the pre-training model, we use the traditional word vector weighting method to produce the word vector of the short text; through the kmeans clustering method, we cluster the short text of the first bucket, thus ensuring the accuracy of the second-level split.

There may be questions here, why not use vector recalls to solve the problem? In fact, the essence of vector recall is vector clustering, supplemented by efficient vector search to achieve the purpose of vector recall. In the secondary split, there is no need for vector lookup, and if introduced will add additional time overhead, so we do not use vector recall.

Error correction

In v1.0, the error accumulates layer by layer and is not corrected. In order to overcome this, we introduce the error correction operation in the last step of the output.

There are mainly two forms of errors: one is incomplete clustering, the data that should be aggregated together are not aggregated together, this kind of error is mainly caused by multi-level splitting, and this problem can be solved by checking across levels; the other error is clustering inaccuracy, which is mainly caused by similarity calculation. We focus on solving the problem of incomplete clustering.

In order to reduce the order of magnitude of the data that needs to be checked, we limit the scope of the error test to the interior of the secondary bucket. After the second-level bucket, we first use the way of fine semantic aggregation to get the clustering results. The correlation of the clustering center in the same secondary bucket is verified. If the correlation is high, it will be further verified and merged.

V2.0 effect

After v2.0 is launched, it can complete the fine clustering of large-scale short texts in a very short time, with clustering accuracy of 95% and recall rate of 80%. It has served the content production of Baidu UGC business line.

Continuous optimization

V2.0 basically realizes the function of fine clustering of large-scale short texts, but there is still a lot of room for change. For example, the persistence of clustering results, repeated computing problems, more efficient aggregation algorithms, and so on, we will continue to optimize to provide more efficient and stable algorithm support.

This is the end of the content of "what are the problems of short text clustering". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.