In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the knowledge of "how to use bitmap to achieve the tag circle function of user profile". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
What are the types of labels?
Enumerated class tags: describe gender and geographical location. The values of such tags are usually enumerable.
Time class tag: describes the business reach and loss time information. Note: time class labels can be stored as numeric values.
Numerical tags: such as account amount, number of credits, etc.
So in essence, there are only two kinds of tags: discrete enumerations and consecutive values.
Once you have a tag, how do you model and store it in your computer?
The simplest and most intuitive way is to set the width table, that is, one field for each label. Usually a small portrait system with hundreds of tags is enough. So for most scenarios, the width table is simple enough to rely on.
Wide tables are typically stored in Hive and are stored in Impala for performance reasons. When the amount of data is large, Impala generally can not meet the performance requirements of the query. This is because Impala does not have an index and each query scans the table. Therefore, in order to be able to take advantage of indexes to improve performance, large and wide tables are typically transferred from Impala to Elasticsearch.
When a user Id attaches hundreds of tags, according to the ES storage mode, it will consume storage resources, and importing data into ES will also become a performance bottleneck. So the workaround is to store all the tags in an array field in ES. But in essence, it is still a wide table scheme.
The biggest problem with the solution of the wide table: the time cost of adding new tags is too high, so the portrait system is basically the effectiveness of Toner 1. If there is no stringent requirement on response time, the ad hoc query engine based on Hadoop ecology can build wide tables. For example, Impala or presto can use multiple wide tables to solve the problem that Xizeng tag titled 0 takes effect. After all, big data system has sufficient storage resources.
Unfortunately, the business demand for the system is: higher, faster, stronger, just like sports.
We use ES to store tags, and the reason for the fast query is that ES builds an inverted index. When we build the tag, the main body of the tag data is the user ID, while in the world of ES, from the point of view of inverted index, the main body of the tag data is the tag, which are two complete opposites.
We use the label circle person, which is essentially the intersection, union and complement operation of the set. So we can simply take another step forward: directly build the tag-user ID mapping instead of the original user ID- tag.
In this way, the entire data structure becomes similar to the following style:
Male: Zhang San, Li Si, Wang Wu.
Since a tag can circle hundreds of millions of users, how to store such a structure? RoaringBitmap . After this storage, the tag circle person breaks away from the SQL and ES syntax and returns to the most essential set operation: An and B or (C and D).
Using tag-user ID as a data modeling method, there is a big problem: the handling of numeric class tags. Such as user credits. Usually one solution is segmentation, but doing so loses data accuracy. Has become inflexible. Another solution is to establish a bitmap for each value. On the one hand, it consumes space, and on the other hand, it can not deal with the problem of interval query very well.
Using the label-user ID approach, bitmap stores the data relationship is the user ID whose label value is equal to XXX, and the extracted core point bitmap stores the equal relationship. Then it is also possible for bitmap storage to be greater or less than a relationship.
For numeric tags, we redefine the storage relationship: bitmap (2) represents all users whose value is greater than 2 ID. Similarly, bitmap (5) represents the ID of all users whose value is greater than 5. In that case, to calculate the number of users between value= (3) and andNot bitmap (1, 000), you can use bitmap (3) and (999). The problem of interval query is well solved. There is still a problem: you need to prepare a bitmap for each value.
The solution to this problem is ingenious: multiple bitmap combinations represent a numerical value. For example, 200, split into bits, 10 bits, 100 bits 3 parts, each part using 10 bitmap storage. In this way, the number of bitmap can be limited to a limited number. For example, for int integers, a maximum of 100 bitmap is required.
There is no end to optimization, and we can go further. If the value is expressed in binary, then only 2 bitmap are needed for each bit, and a maximum of 64 bitmap is needed for an Int type. Using binary, the stored rules can be set as follows:
Bitmap (0) represents the user ID collection with the bit 0. Bitmap (1) represents a collection of user ID whose bit is 0 or 1.
Since there are only two possible values of 0 and 1 for a bit in binary, only bitmap (0) is needed for each bit in binary, so a maximum of 33 bitmap stores are needed.
To sum up, we solve the problem of the number of bitmap and the problem of interval query. However, the combination of multi-bit binary to deal with interval query leads to a new question: how to combine multiple bitmap to represent an interval?
Let's simplify the problem a little bit, how do multiple bitmap represent an interval less than or equal to. Such as I
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.