How to apply and practice Mars 07/06 Update SLTechnology News&Howtos

How to apply and practice Mars

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces you how to carry out the application and practice of Mars, the content is very detailed, interested friends can refer to, hope to be helpful to you.

A brief introduction to Mars

Mars is a unified data science platform, which is used to accelerate the traditional Python data science and technology stack. It can also be accelerated by multi-core or distributed in a single machine. Mars can be deployed on a stand-alone distributed cluster, or on Kubernetes and Hadoop Yarn.

The whole framework of Mars is built on the basis of parallel and distributed scheduling of single machine, and its data science foundation includes three core parts, Tensor, DataFrame and Remote. What is built on this basis is the Mars Learn module, which is compatible with Scikit-learn API and can simply carry out distributed processing on a larger scale of big data. In addition, Mars also supports deep learning and machine learning frameworks, such as the ability to easily run TensorFlow, PyTorch, etc., and visualization can also be done on Mars. In addition, Mars supports a wide range of data sources.

From the traditional Python technology stack to Mars is also very simple, for example, in NumPy and Pandas to become Mars, you only need to replace import, and then become deferred execution.

Ordinary Python functions, when called, become mr.spawn to delay the process, and finally execute concurrently through execute, without worrying about whether Mars is running on a single machine or distributed.

And most of the TensorFlow on Mars is the same, the difference lies in the change of the main function part. Finally, we need to run the script into Mars through run_tensorflow_script.

2. Typical scenes

Scenario 1. CPU and GPU hybrid computing

In the field of security and finance, Mars can be used to do mixed computing of CPU and GPU to speed up the existing workflow.

In this field, because the traditional big data platform has a long mining cycle and tight resources, it takes a long time to carry out the task and can not meet the needs of customers. So Mars DataFrame can be used to speed up data processing, it can do large-scale data sorting, and help users to do high-order statistics and aggregation analysis.

In addition, there are many unsupervised learning algorithms in the security field. Mars learn can accelerate unsupervised learning and at the same time pull up distributed deep learning computing to accelerate the existing deep learning training. After that, GPU can also be used to speed up some computing tasks.

Scenario 2. Interpretable calculation

In the field of advertising, in the interpretation algorithm of advertising attribution and insight features, it takes a long time because of its large amount of computation. In this case, it is difficult to accelerate on a single machine, and the distribution based on the traditional big data platform is not very flexible, but through Mars remote, the computing can be easily distributed to dozens of machines to accelerate and achieve a hundredfold performance improvement.

Scene 3. Large scale K-nearest neighbor algorithm

Mars is very widely used in K-nearest neighbor algorithms, because Embedding is becoming more and more popular, it makes it very common for vectors to represent entities. In addition, Mars's NearestNeighbors algorithm is compatible with scikit-learn, which contains brute force algorithms, and users also need brute force algorithms for large-scale computing, which can be done through multiple worker, thus improving performance a hundredfold. Finally, Mars supports distributed ways to accelerate Faiss and Proxima to tens of millions and hundreds of millions of levels.

III. Demo

Demo1. Analysis of Douban movie data

Let's take a look at how Mars accelerates pandas data processing and visualization from this Demo.

We need to install Mars before starting the demonstration. Here you have created Jupyter, and then pip install pymars.

After installation, we can go to IPython for verification, we can see that there is no problem with the following results, and then we can enter the Jupyter notebook.

Let's start demo. This data can be downloaded from the GitHub address, and then we use pandas to analyze the movie data and ipython memory usage to view memory usage.

We mainly use four CSV files for our data, namely movies, ratings, users and comments.

Next, count how many movies have been released according to the release date. Let's first deal with the data so that the release date is taken only to the year, the date is removed, and the years are aggregated.

After the data comes out, you can draw the graph with pandas bokeh and view it in an interactive way.

Next, let's watch the statistics of movie ratings. First of all, the rated movies are screened out, and then the number of Douban scores is sorted from large to small. As you can see, the highest score is 6.8.

Similarly, it is drawn as a bar chart through pandas bokeh, and the score is almost normally distributed.

Next, do a tag word cloud to see which tag word has the most tag words in the movie. Here, take the tags from movies, divide it with a slash, and then max words is 50.

Next, we will analyze the Top K of the movie. First of all, the average value and number of evaluations are obtained by aggregating according to the movie ID. Then we filter the number of reviews, from the highest to the lowest, to calculate the top20 movies.

Then do a review data analysis. Because the comments are in Chinese, it is necessary to make a participle, then divide each sentence and sort it at the time of statistics. You can add a progress bar here to make it easy to see the progress when processing the data. The process takes about 20 minutes, so there is still a lot of pressure on the machine when running a big task on a single machine.

This is the final word cloud picture.

Next we use Mars to do the same analysis task. The first step is to deploy the Mars environment, and then there are five worker, each with 8 CPU and 32 gigabytes of memory. Again, let's turn on the memory monitoring and do some import. Here, replace import Pandas with import mars.dataframe, and then Numpy is import mars.tensor.

Then we create the to mars dataframe in SDK, which uses almost no memory, and the final result is the same as before.

We use the same way to analyze the number of movies and movie ratings on the release date. Thanks to the high compatibility of Mars and Pandas, we can also use Pandas bokeh to present the results.

The same is true for the analysis of movie reviews, but when displayed, Mars only pulls the first and last few entries, so the client has little memory use. And the entire running process took only 45 seconds, a dozens of times better performance than the previous 20 minutes.

Next, we use Mars to do a regional statistics, so that it has a dynamic effect. First of all, let's take a look at the released movie dataframe that has just been calculated, and then take the movies from 1980 to 2019, and the regions part may have more than one, so it is separated by a slash, and finally perform the discharge of top10 regional movies.

Then we use bar chart race to generate dynamic effects.

Demo2. Douban movie recommendation

For the second demo, we will make a recommendation based on the data from Douban movie just now. We will first use TensorFlow Mars for training, and then use the Mars distributed KNN algorithm to speed up the recall calculation.

We first use the stand-alone technology stack, this data has been divided into training and test sets, so we first to pandas to download it to the local, and then to do a label encode for users and movies, turning it into a number, not a string value. Then we process the data, first sort by time, and then group according to users to generate the result of grouping aggregation.

Next, to start training, we need to use TensorFlow to train the embedding that represents user. As mentioned before, embedding can describe any entity with a vector, so after we get the embedding, we can find the movie embedding that is close to the user in this vector space when we recommend the movie to the user.

After training, we can save the vector, the search scale here is 600000 by 70, 000, and the single machine takes 22 minutes, but if it reaches the level of 10 million, the search time will take more than 800 hours, which is unacceptable.

Next let's look at how to implement this process with Mars. Start by creating a Mars cluster, where there are eight worker. Then, as above, the data is preprocessed, label encode is done, sorted by time, and grouped according to user to generate grouping aggregation.

The only difference here is that Mars automatically infers the result of DataFrame, and if the inference fails, you need to provide your own dtypes and output type.

Then there is execution and training. Here TensorFlow can write a Python file instead of writing it into notebook.

Then we run the script with Mars's run tensorflowscript, and then specify that worker is 8. As you can see, the execution time has been reduced to 23 minutes. At the same time, we also got the final embedding, using Mars to do embedding only takes 1 minute and 25 seconds, which is about ten times higher than the time just now. 14 million times 14 million can also be stabilized at about 1 hour, which is a huge improvement compared with 800 hours on a single machine.

IV. Best practices

First of all, try not to use to pandas and to numpy, because this will turn the distributed data of Mars into stand-alone data, losing the advantage of Mars itself, unless this operation cannot be implemented in Mars; secondly, Mars tensor, DataFrame and learn need to write some functions themselves because they are limited by API, so they can consider using Mars remote to accelerate and abstract the operation into functions. Third, Pandas's acceleration techniques still apply to Mars DataFrame, such as using more efficient data types, preferring built-in operations, and using apply instead of loops.

About how to carry on the Mars application and the practice shares here, hoped that the above content can have the certain help to everybody, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.