What is the data-intensive practice of Appboy based on MongoDB? 04/21 Update SLTechnology News&Howtos

What is the data-intensive practice of Appboy based on MongoDB?

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

In this issue, the editor will bring you about the data-intensive practice of Appboy based on MongoDB. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

Appboy is experimenting with emerging channels such as mobile phones to try a new approach that allows organizations to build better relationships with customers and is a frontier explorer in the market automation industry. The company has achieved some success in mobile exploration, with well-known products such as iHeartMedia, PicsArt, Etsy, Samsung, Urban Outfitters and so on.

The following is an excerpt of the speech:

To support its marketing automation platform, Appboy uses MongoDB as its main data storage layer for its analysis and positioning engine. Nowadays, Appboy has to deal with billions of data points of tens of thousands of users every day. This article will share Appboy's best practices on MongoDB to see how the company remains agile even as it expands rapidly. This article will talk about many topics, such as document random sampling, multivariate testing and its Multi-arm bandit optimization, Field tokenization, and how Appboy stores multidimensional data on the basis of an individual user to optimize the provision of information to end users at the best time.

Part 1:Statistical Analysis

Appboy is suitable for a wide range of customer groups, including entry-level customers with only tens of thousands of users, and customers who already have tens of millions of users. But there is no doubt that even customers with hundreds of millions of users can easily collect and store user data through Appboy marketing automation technology.

The core of Appboy platform is customer segmentation (customer Segmentation). Segmentation allows organizations to locate based on behavioral data, consumption history, technical characteristics, social profiles, and so on. Innovative and intelligent use of Segmentation and information automation allows organizations to seamlessly and easily convert installation users into active users, thus gaining kpi,Segments that can be customized on demand.

When customers use Appboy dashboard to define segment, Appboy can do real-time calculation on some features, such as group size, the size of users who enable message push, and the average consumption power of users. These calculations need to be real-time and interactive, and interactive analysis on an infrastructure that does not have the size of Google is extremely challenging. The challenge here is how to support this scale more efficiently and how to serve users of all sizes. For these reasons, random sampling is a good choice.

On Statistical sampling

In the real world, random statistical sampling happens all the time, such as a public opinion survey of the president of the United States is impossible to ask everyone individually, and national ratings statistics do not rely on rating agencies to check each user's TV. On the contrary, these numbers come from statistical sampling, which obtains the characteristics of larger groups by sampling a small number of groups. Through statistics, a small sample can make an accurate assessment of large groups. Many political opinion polls survey only a few thousand adults to estimate the political inclinations of millions of citizens. However, the reports and statistics of investigation institutions often have the so-called confidence interval, also known as deviation.

Use of statistical sampling

The same principle can be applied here. Compared with traditional analysis databases, sampling users has a significant advantage, because it is possible to sample from the overall behavior of users, rather than from the original event stream. It is important to note that Appboy only samples segment interactive feedback for feedback on the network dashboard. As a marketing campaign or as an analysis of Segment as a Facebook Custom Audience, the exact user is counted, and these principles do not apply.

At the beginning, a random number, called "bucket", is added to the known range document. Select a reasonable small user base so that it is possible to focus on each user, and it is important to note that this sample size multiplied by the number of bucket must cover that range. For example, a sample of 1 to 10 is obviously not enough to support hundreds of millions of people, and 1 to 1 million is obviously a good choice. There are 10, 000 "bucket" in Appboy, so it should be 0 to 9999.

Assuming that there are 10 million documents (on behalf of the user), the documents will first be numbered and indexed.

{random: 4583 db.users.ensureIndex favoritecolors: "blue", age: 29 gender: "M", favorite_food: "pizza", city: "NYC", shoe_size: 11} color ({random:1})

The first step is to get a random sample. 10 million document, 10000 random buckets, each buckets should have 1000 users:

Db.users.find ({random: 123}) .count () = = ~ 1000

Db.users.find ({random: 9043}) .count () = = ~ 1000

Db.users.find ({random: 4982}) .count () = = ~ 1000

If you extract 1% of the entire user base, that's 100000 users. To achieve this, you must select a random range to "host" these users. For example, all of these are possible:

Db.users.find ({random: {$gt: 0,101}) db.users.find ({random: {$gt: 8938, $lt: 604}) db.users.find ({random: {$gt: 8938, $lt: 9039}) db.users.find ({$or: [{random: {$gt: 9955}}, {random: {$lt: 56}}])

After having a random sample, the next step is to analyze it. To measure its true size, you first need to do a count, because given the randomness, it is impossible to be accurate to 100000.

In parallel, you can add any query to the sample and find out the proportion of male users who like blue best.

Sample_size = db.users.find ({random: {$gt: 503, $lt: 604}). Count () observed = db.users.find ({random: {$gt: 503, $lt: 604}, gender: "M", favorite_color: "blue"). Count ()

If the sample size is set to 100000 and the number of observations is 11302. It can be inferred that 11.3% of the 10 million users meet the criteria. To be called a good statistician, you should also provide a confidence interval to estimate the deviation. The mathematics behind the confidence interval is a bit complicated, but if you want to try it yourself, there are countless samples of sizecalculators to refer to. In the case used in this paper, the confidence interval is + /-0.2%.

Optimize

In practice, Appboy makes a lot of optimizations based on these high-level concepts when performing statistical sampling. First, Appboy uses the MongoDB aggregation framework and uses caching heavily. Because the memory-mapped storage engine is used here, the advantage of using MongoDB for this sampling is that you can run any query once the random sample is loaded into memory. This provides an excellent experience on the web dashboard, where users can explore interactively by adding and removing selection criteria and immediately seeing statistics updates.

Part 2: multivariable testing and ratio limits

Getting started with multivariable testing

In today's highly competitive market, user segmentation is essential. As experience and brands continue to shift rapidly to emerging channels such as mobile, information customization and relevance are more important than ever for marketing, which is why user classification is a prerequisite for interacting with customers.

So once a partition is defined, the next goal is to optimize message push to maximize its transformation, and multivariable testing is one way to achieve this. Multivariate testing is an experiment used to compare user responses to multiple different promotional methods for the same marketing campaign. These versions share the same marketing goals, but differ in wording and style, and the goal of multivariable testing is to determine which version achieves the best conversion.

For example, suppose you have three different push notification messages:

Message 1:This deal expires tomorrow!

Message 2:This deal expires in 24 hours!

Message 3:Fourth of July is almost over! All deals end tomorrow!

In addition, in addition to messages, a large number of pictures are usually tested to match the text.

Using multivariable testing, the agency can find out which wording produces a higher conversion rate. The next time you send a push notice to do business, you'll know which tone and wording is more effective. Even better, you can find out which messages are more effective by limiting the size of the test, for example, in a small audience, and then send these valid messages to others.

When conducting a multivariable test, the goal of the message push is to test the whole, but other users in the same subdivision will not receive the message. Thus, the agency can evaluate by comparing the two responses.

Application of technology

From a technical point of view, the person who receives the message should be random. In other words, if you have 1 million users and want to send a test to 50000, those 50000 should be randomly distributed among your user base (you also want another group of 5000 users as a control group). Similarly, if you want to test 10 to 50000 users, randomness helps ensure that the users of each test group are different.

Think about this problem, it is one line with the ratio limit problem in 1 message. Many customers want to send a message to a small group of users. For example, an e-commerce company wants to randomly distribute 50000 discount codes among its user base. Conceptually, this is the same problem.

To do this, you can scan the user with random values on each document.

Appboy manages users in parallel through random values in different random ranges. And track the global state, so you can know when the limit of the ratio is reached. For multivariable testing, a message version is then selected either by sending a ratio or randomly.

Be careful

Those with mathematical thinking may have noticed that if statistical analysis is used in random fields and individuals are selected to receive messages based on the same random fields, biases will occur in some cases. To illustrate this, assume that all users are selected with a random bucket value of 10, and messages are sent to them at random. This means that users who receive messages in this user bucket will no longer be randomly distributed. As a simple solution, Appboy uses multiple random values for users, and be careful not to use the same random values for multiple purposes.

Part 3: mode flexibility-Extensible user profile

When each user opens any application of Appboy, a rich user profile is created. The user's basic field looks like this:

{first_name: "Jane", email: "jane@example.com", dob: "1994-10-24", gender: "F", country: "DE",...}

The Appboy client can also store "custom properties" for each user. As an example, a sports application may want to store users'"favorite players", while e-commerce applications may store brands that customers have recently purchased, and so on.

{first_name: "Jane", email: "jane@example.com", dob: 1994-10-24 gender: "F", custom: {brands_purchased: "Puma and Asics", credit_card_holder: true,shoe_size: 37.},...}

A huge benefit of this is that these custom attributes can be inserted directly when other properties are updated. Because MongoDB provides flexible schemas, it's easy to add any number of custom fields without worrying about its type (boolean, string, intege, float, or whatever). MongoDB handles all this, and query custom properties are easy to understand. No complex joins are provided for a value column, which you often need to define in advance in traditional relational databases.

Db.users.find (…) .update ({$set: {"custom.loyalty_program": true}}) db.users.find ({"custom.shoe_size": {$gt: 35}})

The disadvantage of this is that if the user accidentally uses a very long name to define ("this_is_my_really_long_custom_attribute_name_it_represents_shoe_size") or is MongoDB in the client, it will take up a lot of space in earlier versions of MongoDB. In addition, because the type is not mandatory, there may be a cross-documents value type mismatch. One document may list someone {visited_website: true}, but if you are not careful, the other may be {visited_website: "yes"}.

Tokenization

To solve the first problem, a map is usually used to segment the user attribute name. Typically, this can be a documents in MongoDB, such as mapping the "shoe_size" value to a unique, predictable short string. This mapping can be generated simply by using the automatic operation of MongoDB.

In mapping, an array is usually used for storage, and the array index is "token". Each customer has at least one document that contains an array field of list. When you add a new custom property for the first time, you can automatically place it at the end of the list, generate an index ("token"), and cache it after the first retrieval:

Db.custom_attribute_map.update ({_ id: X, list: {$ne: "Favorite Color"}}, {$push: {list: "Favorite Color"}})

There may be such a list:

["Loyalty Program", "Shoe Size", "Favorite Color"] 0 1 2

MongoDB best practices need to warn of the growth of documents, while documents has been growing indefinitely since the definition of Appboy. In fact, this potential problem has been considered, and here it is to allow users to use multiple documents by limiting the size of the array. When adding new items to a list, if the array length is less than a certain size, the update operation can only be limited to $push. If the update operation does not generate a new $push, an automatic $findAndModify can be used to create a new document and add elements.

Tokenization does add some indirection and complexity, but it can customize the mapping properties to pass through the entire code base. This solution can also be applied to other problems, such as mismatches in data type documents. You can also use mapping to track data types here. For example, the record "visited_website" is a boolean that accepts only true or false.

Part 4: data-intensive algorithms

Intelligent Selection and Multi-Arm Bandit Multivariate Testing

The goal of multivariable testing is to find the highest conversion rate in the shortest time, which can now be found on a large number of platforms, and customers will test regularly and find the best one.

Appboy has a feature called Intelligent Selection (Intelligent selection), which can analyze the performance of multivariable tests and automatically adjust the proportion of users who receive different versions of messages according to statistical algorithms. The statistical algorithm here, which is called themulti-arm bandit, ensures the most real performance, rather than random probability.

The mathematical algorithms behind multi-arm bandit are so complex that I won't elaborate on them here. Let's take a look at the speech made by Peter Whittle, a professor of mathematical statistics at the University of Cambridge in 1979:

[The bandit problem] was formally proposed during World War II, and allied analysts racked their brains and suffered so much that they suggested leaving the problem to Germany as the ultimate means of knowledge warfare.

However, the reason for proposing this algorithm shows that in order to run effectively, the multi-arm bandit algorithm needs to input a large amount of data. For each message version, the algorithm calculates the recipient and the conversion rate. This is where MongoDB glows, because you can use pre-aggregated analytics to automatically accumulate data in real time:

{company_id: BSON::ObjectId,campaign_id: BSON::ObjectId,date: 2015-05-31 recording messagesVariations 1: {unique_recipient_count: 1000000 training totalmatters conversionpercent count: 5000memorialopening rates: 8000Finals hourlyfeatured letters: {0: {unique_recipient_count: 1000heraltotalconversation conversionshifts count: 40charatertotalskills openationrates: 125montage conversions.}, message_variation_2: {...}

Through such a mode, you can quickly view daily and hourly conversion changes, open and send. The Appboy pattern is a little more complicated because there are other factors to consider here, which need to be noted here.

Pre-polymerized documents allows rapid termination of the experiment. Given the fragmentation of collection for each company, it is possible to optimize all activities of a company in an extended manner at the same time.

Intelligent delivery

Another proprietary algorithm that Appboy provides to customers is called intelligent delivery. When planning the deployment of message activities, Appboy analyzes the optimal time to send messages to each user and provides customers at the right time. For example, Alice is more likely to get app push messages at night, while Bob likes to put it before going to work in the morning, then Appboy will push messages at the time they are most happy to do. This is a miraculous feature, as Urban Outfitters CRM and Interactive Marketing director Jim Davis praised:

Compared with the opening ratio before and after use, you can see that the performance has doubled. The weekly activity target for male Urban On members has been increased by 138%. In addition, the subdivision function is also commendable, increasing the number of inactive users by 94% for more than 3 months.

There is no doubt that this algorithm is data-intensive. In order to intelligently predict the best time to send messages to each customer, the behavior characteristics of this user must be clearly analyzed. In addition, Appboy sends tens of millions of intelligent delivery messages every day, so a very high forecast is required here.

The approach here is similar to Intelligent Selection, mainly by pre-aggregating multiple dimensions on a per-user basis in real time. With MongoDB, each user has a lot of documents, similar to the following code:

{_ id: BSON::ObjectId of user,dimension_1: [DateTime, DateTime, …] , dimension_2: [Float, Float, …] , dimension_3: [...] ,...}

When the user dimension data of interest enters, it should be formalized and a copy of the documents should be saved. Each documents is done through {_ id: "hashed"} to optimize the distribution of reads and writes. When you need to send a message with Intelligent Delivery, you can quickly query a series of documents and send it to the machine learning algorithm. In this respect, MongoDB has helped a lot, and its scalability has now supported dozens of dimensions of the system. As new dimensions are added, the machine learning algorithm is constantly modified, and the process has been benefiting from the flexibility of MongoDB.

This is what the data-intensive practice of Appboy based on MongoDB is shared by the editor. If you happen to have similar doubts, please refer to the above analysis. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.