What are several ways of dealing with Cartesian product by PyODPS DataFrame? 04/18 Update SLTechnology News&Howtos

What are several ways of dealing with Cartesian product by PyODPS DataFrame?

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about the ways in which PyODPS DataFrame deals with Cartesian products, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

PyODPS provides DataFrame API for large-scale data analysis and preprocessing with pandas-like interfaces. This article mainly introduces how to use PyODPS to perform Cartesian product operations.

The most common scenario of Cartesian product is the need for comparison or operation between two pairs. Taking the calculation of geographical location distance as an example, suppose that the large table Coordinates1 stores the longitude and latitude coordinates of the target point, there are M rows of data, and the small table Coordinates2 stores the starting point longitude and latitude coordinates, with a total of N rows of data. Now it is necessary to calculate all the starting point coordinates closest to the target point. For a target point, we need to calculate the distance from all the starting points to the target point, and then find the minimum distance, so the whole intermediate process needs to produce M * N pieces of data, that is, a Cartesian product problem.

Haversine formula

First of all, let's briefly introduce the background knowledge. We know the longitude and latitude of the coordinate points of the two geographical locations. The distance between the two points can be solved by using haversine formula. The expression of Python is as follows:

Def haversine (lat1, lon1, lat2, lon2): # lat1, lon1 is the latitude and longitude coordinate of location 1 # lat2 Lon2 is the latitude and longitude coordinate import numpy as np dlon = np.radians (lon2-lon1) dlat = np.radians (lat2-lat1) a = np.sin (dlat / 2) * * 2 + np.cos (np.radians (lat1)) * np.cos (np.radians (lat2)) * np.sin (dlon / 2) * * 2 C = 2 * np.arcsin (np.sqrt (a)) r = 6371 # mean radius of the earth In kilometer return c * rMapJoin

At present, the most recommended method is to use mapjoin in mapjoin,PyODPS. It is very simple to specify mapjoin=True when only two dataframe join are needed, and mapjoin operation will be performed on the right table during execution.

In [3]: df1 = o.get_table ('coordinates1') .to_df () In [ 4]: df2 = o.get_table ('coordinates2'). To_df () In [5]: df3 = df1.join (df2) Mapjoin=True) In [6]: df1.schema Out [6]: Odps.Schema {latitude float64 longitude float64 id string} In [7]: df2.schema Out [7]: odps.Schema {latitude float64 longitude Float64 id string} In [8]: df3.schema Out [8]: odps.Schema {latitude_x float64 longitude_x float64 id_x String latitude_y float64 longitude_y float64 id_y string}

You can see that the rename column will be suffixed with _ x and _ y by default when executing join. You can customize the suffix by passing a binary tuple in the suffixes parameter. When you have the table after join, you can calculate the distance through the self-built function of DataFrame in PyODPS, which is very concise and efficient.

In [9]: r = 6371...: dis1 = (df3.latitude_y-df3.latitude_x). Radians (): dis2 = (df3.longitude_y-df3.longitude_x). Radians (): a = (dis1 / 2). Sin () * * 2 + df3.latitude_x.radians (). Cos () * df3.latitude _ y.radians () .cos () * (dis2 / 2) .sin () * * 2...: df3 ['dis'] = 2 * a.sqrt () .arcsin () * r In [12]: df3.head (10) Out [12]: latitude_x longitude_x id_x latitude_y longitude_y id_y dis0 76 . 252432 59.628253 0 84.045210 6.517522 0 1246.8649811 76.252432 59.628253 0 59.061796 0.794939 1 2925.9531472 76.252432 59.628253 0 42.368304 30.119837 2 4020.6049423 76.252432 59.628253 0 81.290936 51.682749 3 584.7797484 76.252432 59.628253 0 34.665222 147.167070 4 6213.9449425 76.252432 59.628253 0 58.058854 165.471565 5 4205.2191796 76.252432 0 79.150677 58.661890 6 323.0707857 76.252432 59.628253 072.622352 123.195778 7 1839.3807608 76.252432 59.628253 080.063614 138.845193 8 1703.7824219 76.252432 59.628253 0 36.231584 90.774527 9 4717.284949In [13]: df1.count ( ) Out [13]: 2000In [14]: df2.count () Out [14]: 100In [15]: df3.count () Out [15]: 200000

Df3 already has M * N pieces of data. Next, if you need to know the minimum distance, you can directly call groupby on df3 and connect the min aggregation function to get the minimum distance of each target point.

In [16]: df3.groupby ('id_x'). Dis.min (). Head (10) Out [16]: dis_min0 323.0707851 64.7554932 1249.2831693 309.8182884 1790.4847485 385.1077396 498.8161577 615.9874678 437.7654329 272.589621DataFrame custom function

If we need to know the city corresponding to the minimum distance point, that is, the corresponding id in the table, we can call MapReduce after mapjoin, but another way is to use the apply method of DataFrame. To use a custom function on a row of data, you can use the apply method, and the axis parameter must be 1, indicating that the operation is on the row.

Table resources

Note that apply is a UDF executed on the server side, so you can't use an expression similar to df=o.get_table ('table_name'). To_df () within a function to get table data. You can refer to where the PyODPS DataFrame code runs. In the case of this article, for example, to calculate all the records in Table 1 and Table 2, you need to use Table 2 as a resource table, and then reference the table resource in your customization. It is also convenient to use table resources in PyODPS, as long as you pass a collection into the resources parameter. Collection is an iterable object, not a DataFrame object. You cannot call the interface of DataFrame directly. Each iterative value is a namedtuple, which can be obtained by field name or offset.

# # use dataframe udfdf1 = o.get_table ('coordinates1'). To_df () df2 = o.get_table (' coordinates2'). To_df () def func (collections): import pandas as pd collection = collections [0] ids = [] latitudes = [] longitudes = [] for r in collection: ids.append (r.id) latitudes.append (r.latitude) longitudes.append ( R.longitude) df = pd.DataFrame ({'id': ids 'latitude':latitudes,' longitude':longitudes}) def h (x): df ['dis'] = haversine (x.latitude, x.longitude, df.latitude, df.longitude) return df.iloc [df [' dis'] .idxmin ()] ['id'] return hdf1 [df1.id, df1.apply (func, resources= [df2], axis=1, reduce=True Types='string') .rename ('min_id')] .execute (libraries= [' pandas.zip', 'python-dateutil.zip',' pytz.zip', 'six.tar.gz'])

In the custom function, the table resource is read into pandas DataFrame by loop, and the row corresponding to the minimum value can be easily found by using the loc of pandas, thus the nearest starting point id can be obtained. In addition, if you need to use a three-party package in a custom function (such as pandas in this example), you can refer to this article.

Global variable

When the amount of data in a small table is very small, we can even use the small table data as a global variable in a custom function.

Df1 = o.get_table ('coordinates1'). To_df () df2 = o.get_table (' coordinates2'). To_df () df = df2.to_pandas () def func (x): df ['dis'] = haversine (x.latitude, x.longitude, df.latitude, df.longitude) return df.iloc [df [' dis'] .idxmin ()] ['id'] df1 [df1.id, df1.apply (func, axis=1, reduce=True) Types='string') .rename ('min_id')] .execute (libraries= [' pandas.zip', 'python-dateutil.zip',' pytz.zip', 'six.tar.gz'])

When uploading a function, the global variable (df in the above code) used in the function is pickle to the UDF. However, the usage scenario of this method is very limited, because the size of the file resources uploaded by ODPS is limited, so the resources generated by UDF are too large to upload. Moreover, it is best to ensure that the versions of the client and server of the three-party package are the same, otherwise serialization is likely to occur, so it is recommended to use it only when the amount of data is very small.

Using PyODPS to solve the problem of Cartesian product is mainly divided into two ways, one is mapjoin, more intuitive, good performance, generally can be solved with mapjoin we recommend to use mapjoin, and it is best to use built-in function calculation, can achieve the highest efficiency, but it is not flexible enough. The other is to use the DataFrame custom function, which is more flexible and has relatively poor performance (you can use pandas or numpy to improve the performance). By using table resources, you can transfer small tables as table resources into the DataFrame custom function to complete the operation of Cartesian product.

After reading the above, do you have any further understanding of the ways PyODPS DataFrame deals with Cartesian products? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.