Pyspark processes datasets with column delimiters in data 12/08 Update SLTechnology News&Howtos

Pyspark processes datasets with column delimiters in data

2025-12-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you the data set with column delimiters in Pyspark processing data, which is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

The main content below is to deal with special scenarios where column delimiters or delimiters exist in the dataset. Dealing with this type of dataset can sometimes be a headache for Pyspark developers, but you have to deal with it anyway.

The dataset is basically as follows:

# first line is the headerNAME | AGE | DEP

Vivek | Chaudhary | 32 | BSC

John | Morgan | 30 | BE

Ashwin | Rao | 30 | BE

The dataset contains three columns "Name", "AGE" and "DEP", separated by the delimiter "|". If we look at the dataset, it also contains the'| 'column name.

Let's see how to take the next step:

Step 1. Use the Read .csv () method of spark to read the dataset:

# create spark session

Import pyspark

From pyspark.sql import SparkSession

Spark=SparkSession.builder.appName ('delimit') .getOrCreate ()

The above command helps us connect to the spark environment and let us use spark.read.csv () to read the dataset

# create

Df=spark.read.option ('delimiter',' |') .csv (rattlestick Universe delimittaldata.txtwriting reproducibility schemas truewriting headerwriting True)

Df.show ()

After reading the data from the file and putting the data into memory, we find that where is the last column of data, and the column age must have an integer data type, but we see something else. This is not what we expected. It's a mess. it's a complete mismatch, isn't it? The answer is yes, it is a mess.

Now, let's learn how to solve this problem.

Step 2. Read the data again, but this time use the Read .text () method:

Df=spark.read.text (delimitarily moving data. Txt')

Df.show (truncate=0)

# extract first row as this is our header

Head=df.first () [0]

Schema= ['fname','lname','age','dep']

Print (schema)

Output: ['fname',' lname', 'age',' dep']

The next step is to split the dataset based on column delimiters:

# filter the header, separate the columns and apply the schema

Df_new=df.filter (df ['value']! = head) .rdd.map (lambda xvalue' x [0] .split (' |')) .toDF (schema)

Df_new.show ()

Now, we have successfully separated the column.

We have successfully split the "|" separated column ("name") data into two columns. Now, the data is cleaner and can be used easily.

Next, join the columns "fname" and "lname":

From pyspark.sql.functions import concat, col, lit

Df1=df_new.withColumn ('fullname',concat (col (' fname'), lit ("|"), col ('lname')

Df1.show ()

To verify the data conversion, we will write the converted dataset to the CSV file, and then use read. The CSV () method reads it.

Df1.write.option ('sep',' |'). Mode ('overwrite'). Option (' header','true') .csv (r'\ cust_sep.csv')

The next step is data validation:

Df=spark.read.option ('delimiter',' |') .csv (rrecording schemas for Trueheaderwriting True)

Df.show ()

Now the data looks like what we want.

The above is Pyspark dealing with data sets with column delimiters. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.