In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly shows you the "Spark 3.0 pandas support and its example analysis of the conversion between DataFrame", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and study the "Spark 3.0 pandas support and its example analysis of the conversion between DataFrame and DataFrame" this article.
Pandas is a data analysis library widely used by python users. Spark 3.0 has been able to support pandas interface, thus making up for the deficiency that pandas can not process big data across computers. Pandas can also be converted to and from Spark's original DataFrame, making it easy for Spark and Python libraries to call each other.
1 、 Koalas: pandas API on Apache Spark
The Koalas (https://koalas.readthedocs.io/en/latest/) project enables data scientists to deal with big data more efficiently by implementing a set of pandas DataFrame API on top of Spark. Pandas is the de facto standard for python data processing, while Spark is the de facto standard handled by big data. With Koalas, you can:
Improve big data's processing productivity immediately through Spark. If you are familiar with pandas, you don't have to learn any new knowledge.
In pandas (tests, smaller datasets) and Spark (distributed datasets), only a set of data analysis code is needed to facilitate the expansion from the research environment to the production process.
1.1 installation Guid
Koalas requires PySpark, and PySpark needs to be installed first.
Several ways to install Koalas include:
Conda
PyPI
Installation from source
To install PySpark, you can use:
Installation with the official release channel
Conda
PyPI
Installation from source
1.2 versions supported by Python
Python 3.5 and above are recommended.
1.3 install Koalas
Install via Conda
First you need to install Conda, and then create a conda environment. As follows:
Conda create-name koalas-dev-env
A minimum environment with only Python will be created and the current environment will be activated:
Conda activate koalas-dev-env
Install Koalas:
Conda install-c conda-forge koalas
Install a specific version of Koalas:
Conda install-c conda-forge koalas=0.19.0
Install from PyPI
Koalas can be installed from PyPI using pip:
Pip install koalas
Install from source code
Check out Contribution Guide for more guides.
1.4 install PySpark
Installed on the official channel:
Install PySpark and download from the official release channel. After downloading, unpack:
Tar xzvf spark-2.4.4-bin-hadoop2.7.tgz
Set the SPARK_HOME environment variable:
Cd spark-2.4.4-bin-hadoop2.7export SPARK_HOME= `pwd`
Make sure that PYTHONPATH can be found by PySpark and Py4J at $SPARK_HOME/python/lib:
Export PYTHONPATH=$ (ZIPS= ("$SPARK_HOME" / python/lib/*.zip); IFS=:; echo "${ZIPS [*]}"): $PYTHONPATH
Install from Conda:
PySpark can also be installed from Conda:
Conda install-c conda-forge pyspark
Install from PyPI:
PySpark can be installed from PyPI:
Quick use of pip install pyspark2 and Koalas
First, the import Koalas is as follows:
Import pandas as pdimport numpy as npimport databricks.koalas as ksfrom pyspark.sql import SparkSession data object creation
Create a Koalas Series and create an integer sequence value:
S = ks.Series ([1,3,5, np.nan, 6,8]) s0 1.01 3.02 5.03 NaN4 6.05 8.0dtype: float64
Create a Koalas DataFrame, import a dictionary object, and turn it into a sequence:
Kdf = ks.DataFrame ({'averse: [1,2,3,4,5,6],' baked: [100,200,300,400,500,600], 'two: ["one", "two", "three", "four", "five", "six"]}, index= [10,20,30,40,50,60]) kdfabc101100one202200two303300three404400four505500five606600six
Create pandas DataFrame and import numpy array with datetime index and labeled columns:
Dates = pd.date_range ('20130101, periods=6) datesDatetimeIndex ([' 2013-01-01-01, '2013-01-02,' 2013-01-03, '2013-01-04,' 2013-01-05, '2013-01-06], dtype='datetime64 [ns]', freq='D') pdf = pd.DataFrame (np.random.randn (6, 4), index=dates Columns=list ('ABCD') pdfABCD2013-01-01-0.4072910.066551-0.0731490.6482192013-01-02-0.8487350.43770.63570.3128612013-01-03-0.415537-1.7870720.22210.1255432013-01-04-1.6372711.1348100.2825320.1339952013-01-05-1.230477-1.9257340.736288-0.5476772013-01-061.092894-1.0712810.318752-0.477591
Now, change pandas DataFrame to Koalas DataFrame:
Kdf = ks.from_pandas (pdf) type (kdf) databricks.koalas.frame.DataFrame
It looks almost the same as pandas DataFrame.
More routines: https://koalas.readthedocs.io/en/latest/getting_started/10min.html
3. The conversion between pandas and dataframe
Pandas, dataframe and koalas can be converted into each other. Note that the conversion efficiency between pandas and dataframe is low, and the native interface of pandas is standalone, so it is recommended to use Koalas.
3.1 pandas's dataframe to spark's dataframefrom pyspark.sql import SparkSession# initialization spark session spark = SparkSession\ .builder\ .getOrCreate () spark_df = spark.createDataFrame (pandas_df) 3.2 spark's dataframe to pandas's dataframeimport pandas as pdpandas_df = spark_df.toPandas ()
Since pandas is a stand-alone version, that is, toPandas () is a stand-alone version, refer to breeze_lsw and change it to a distributed version:
Import pandas as pddef _ map_to_pandas (rdds): return [pd.DataFrame (list (rdds))] def topas (df N_partitions=None): if n_partitions is not None: df = df.repartition (n_partitions) df_pand = df.rdd.mapPartitions (_ map_to_pandas). Collect () df_pand = pd.concat (df_pand) df_pand.columns = df.columnsreturn df_pand pandas_df = topas (spark_df) these are all the contents of the article "sample Analysis of pandas support and its conversion to DataFrame in Spark 3.0" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.