Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize column-column conversion, that is, wide table and narrow table transformation in spark

2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Spark in how to achieve row and column conversion, that is, wide table narrow table conversion, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

From time to time, the code spark column turns to from pyspark import SparkContext, SparkConffrom pyspark.sql import SparkSession, SQLContext, Row, functions as Ffrom pyspark.sql.functions import array, col, explode, struct, litconf = SparkConf (). SetAppName ("test"). SetMaster ("local [*]") sc = SparkContext (conf=conf) spark = SQLContext (sc) # df is datasource, by will exclude columndef df_columns_to_line (df) By): # Filter dtypes and split into column names and type description df_a = df.select ([col (c) .cast ("string") for c in df.columns]) cols, dtypes = zip (* (c, t) for (c, t) in df_a.dtypes if c not in by) # Spark SQL supports only homogeneous columns assert len (set (dtypes)) = = 1, "All columns have to be of the same type" # Create and explode an array of (column_name Column_value) structs kvs = explode (array ([struct (lit (c) .alias ("feature"), col (c) .alias ("value") for c in cols]) .alias ("kvs") return df_a.select (by + [kvs]) .select (by + ["kvs.feature", "kvs.value"]) df = sc.parallelize ([(1,0.0,0.6), (1,0.6) ToDF (["A", "col_1", "col_2"]) df_row_data = df_columns_to_line (df) ["A"]) df.show () df_row_data.show () > > df.show () +-- + | A | col_1 | col_2 | +-- + | 1 | 0. 0 | 0. 6 | 1 | 0. 6 | 0. 7 | +-> > df_row_data.show () +-+ -+ | A | feature | value | +-- + | 1 | col_1 | 0.0 | 1 | col_2 | 0.6 | 1 | col_1 | 0.6 | | 1 | col_2 | 0.7 | +-- +

Note that feature and value are the final two column names that are redefined after the original multiple column names are converted into row data.

Spark row transfer column df_features = df_row_data.select ('feature'). Distinct (). Collect () features = map (lambda r:r.feature, df_features) df_column_data = df_row_data.groupby ("A"). Pivot (' feature', features) .agg (F.first ('value') Ignorenulls=True)) df_column_data.show () +-- + | A | col_2 | col_1 | +-- + | 1 | 0.6 | 0.0 | +-+

Row to column is relatively simple, based on the above results directly conversion, the key is the use of pivot function

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report