Example Analysis of pandas support and its conversion to DataFrame in Spark 3.0 04/25 Update SLTechnology News&Howtos

Example Analysis of pandas support and its conversion to DataFrame in Spark 3.0

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you the "Spark 3.0 pandas support and its example analysis of the conversion between DataFrame", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and study the "Spark 3.0 pandas support and its example analysis of the conversion between DataFrame and DataFrame" this article.

Pandas is a data analysis library widely used by python users. Spark 3.0 has been able to support pandas interface, thus making up for the deficiency that pandas can not process big data across computers. Pandas can also be converted to and from Spark's original DataFrame, making it easy for Spark and Python libraries to call each other.

1 、 Koalas: pandas API on Apache Spark

The Koalas (https://koalas.readthedocs.io/en/latest/) project enables data scientists to deal with big data more efficiently by implementing a set of pandas DataFrame API on top of Spark. Pandas is the de facto standard for python data processing, while Spark is the de facto standard handled by big data. With Koalas, you can:

Improve big data's processing productivity immediately through Spark. If you are familiar with pandas, you don't have to learn any new knowledge.

In pandas (tests, smaller datasets) and Spark (distributed datasets), only a set of data analysis code is needed to facilitate the expansion from the research environment to the production process.

1.1 installation Guid

Koalas requires PySpark, and PySpark needs to be installed first.

Several ways to install Koalas include:

Conda

PyPI

Installation from source

To install PySpark, you can use:

Installation with the official release channel

Conda

PyPI

Installation from source

1.2 versions supported by Python

Python 3.5 and above are recommended.

1.3 install Koalas

Install via Conda

First you need to install Conda, and then create a conda environment. As follows:

Conda create-name koalas-dev-env

A minimum environment with only Python will be created and the current environment will be activated:

Conda activate koalas-dev-env

Install Koalas:

Conda install-c conda-forge koalas

Install a specific version of Koalas:

Conda install-c conda-forge koalas=0.19.0

Install from PyPI

Koalas can be installed from PyPI using pip:

Pip install koalas

Install from source code

Check out Contribution Guide for more guides.

1.4 install PySpark

Installed on the official channel:

Install PySpark and download from the official release channel. After downloading, unpack:

Tar xzvf spark-2.4.4-bin-hadoop2.7.tgz

Set the SPARK_HOME environment variable:

Cd spark-2.4.4-bin-hadoop2.7export SPARK_HOME= `pwd`

Make sure that PYTHONPATH can be found by PySpark and Py4J at $SPARK_HOME/python/lib:

Export PYTHONPATH=$ (ZIPS= ("$SPARK_HOME" / python/lib/*.zip); IFS=:; echo "${ZIPS [*]}"): $PYTHONPATH

Install from Conda:

PySpark can also be installed from Conda:

Conda install-c conda-forge pyspark

Install from PyPI:

PySpark can be installed from PyPI:

Quick use of pip install pyspark2 and Koalas

First, the import Koalas is as follows:

Import pandas as pdimport numpy as npimport databricks.koalas as ksfrom pyspark.sql import SparkSession data object creation

Create a Koalas Series and create an integer sequence value:

S = ks.Series ([1,3,5, np.nan, 6,8]) s0 1.01 3.02 5.03 NaN4 6.05 8.0dtype: float64

Create a Koalas DataFrame, import a dictionary object, and turn it into a sequence:

Kdf = ks.DataFrame ({'averse: [1,2,3,4,5,6],' baked: [100,200,300,400,500,600], 'two: ["one", "two", "three", "four", "five", "six"]}, index= [10,20,30,40,50,60]) kdfabc101100one202200two303300three404400four505500five606600six

Create pandas DataFrame and import numpy array with datetime index and labeled columns:

Dates = pd.date_range ('20130101, periods=6) datesDatetimeIndex ([' 2013-01-01-01, '2013-01-02,' 2013-01-03, '2013-01-04,' 2013-01-05, '2013-01-06], dtype='datetime64 [ns]', freq='D') pdf = pd.DataFrame (np.random.randn (6, 4), index=dates Columns=list ('ABCD') pdfABCD2013-01-01-0.4072910.066551-0.0731490.6482192013-01-02-0.8487350.43770.63570.3128612013-01-03-0.415537-1.7870720.22210.1255432013-01-04-1.6372711.1348100.2825320.1339952013-01-05-1.230477-1.9257340.736288-0.5476772013-01-061.092894-1.0712810.318752-0.477591

Now, change pandas DataFrame to Koalas DataFrame:

Kdf = ks.from_pandas (pdf) type (kdf) databricks.koalas.frame.DataFrame

It looks almost the same as pandas DataFrame.

More routines: https://koalas.readthedocs.io/en/latest/getting_started/10min.html

3. The conversion between pandas and dataframe

Pandas, dataframe and koalas can be converted into each other. Note that the conversion efficiency between pandas and dataframe is low, and the native interface of pandas is standalone, so it is recommended to use Koalas.

3.1 pandas's dataframe to spark's dataframefrom pyspark.sql import SparkSession# initialization spark session spark = SparkSession\ .builder\ .getOrCreate () spark_df = spark.createDataFrame (pandas_df) 3.2 spark's dataframe to pandas's dataframeimport pandas as pdpandas_df = spark_df.toPandas ()

Since pandas is a stand-alone version, that is, toPandas () is a stand-alone version, refer to breeze_lsw and change it to a distributed version:

Import pandas as pddef _ map_to_pandas (rdds): return [pd.DataFrame (list (rdds))] def topas (df N_partitions=None): if n_partitions is not None: df = df.repartition (n_partitions) df_pand = df.rdd.mapPartitions (_ map_to_pandas). Collect () df_pand = pd.concat (df_pand) df_pand.columns = df.columnsreturn df_pand pandas_df = topas (spark_df) these are all the contents of the article "sample Analysis of pandas support and its conversion to DataFrame in Spark 3.0" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.