Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use big data tool pyspark

2025-10-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

What this article shares with you is about how to use big data's tool pyspark. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Spark is currently the core technology stack in big data's field, and many partners engaged in data-related work want to tame it and become a "dragon trainer" in order to be able to ride the cluster dragon composed of hundreds of machines to ride in the sea of big data.

But most of my buddies have failed to do so. Some of my friends are obsessed with whether to learn pyspark or spark-scala for a long time, a considerable number of them have fallen on the initial configuration of the environment, some have lost their way in the use of tens of hundreds of functions, and there are a small number of students who have mastered some simple usage, but have not mastered performance optimization skills, and there is nothing they can do once they encounter the really complex big data.

One, pyspark or spark-scala

Pyspark is better than analysis, and spark-scala is better than engineering.

If the application scenario has very high performance requirements, you should choose spark-scala.

If the application scenario has a lot of visualization and machine learning algorithm requirements, it is recommended to use pyspark, which can be better used with related libraries in python.

In addition, spark-scala supports the spark graphx graph calculation module, while pyspark does not.

The pyspark learning curve is smooth, while the spark-scala learning curve is steep.

In terms of learning cost, the learning curve of spark-scala is steep, not only because scala is a difficult language, but also because there will be endless pain of environment configuration waiting for readers on the road ahead.

On the other hand, the learning cost of pyspark is relatively low, and the environment configuration is relatively easy. In terms of learning cost, if the learning cost of pyspark is 3, then the learning cost of spark-scala is about 9.

If readers have strong learning ability and sufficient learning time, it is recommended to choose spark-scala, which can unlock all the skills of spark and get the best performance, which is also the most common way to use spark in industry.

If the reader has limited study time and has a special preference for Python, it is recommended to choose pyspark. The use of pyspark in industry is becoming more and more common.

Second, the study plan of this book

1. Study plan

It is very suitable to be used as a tool manual for pyspark as a reference for the case base when the project is on the ground.

2. Learning environment

All the source code has passed the test written in jupyter. It is recommended that you clone it locally through git and run and learn interactively in jupyter.

In order to be able to open markdown files directly in jupyter, it is recommended that you install jupytext and convert markdown to ipynb files.

Follow these 2 steps to configure a stand-alone spark3.0.1 environment for practice.

# step1: install java8#jdk#step2: install pyspark,findsparkpip install-I

In addition, you can also run pyspark directly in the cloud notebook of the whale community without any environment configuration pain.

Import findspark

# specify spark_home, specify python path

Spark_home = "/ Users/liangyun/anaconda3/lib/python3.7/site-packages/pyspark"

Python_path = "/ Users/liangyun/anaconda3/bin/python"

Findspark.init (spark_home,python_path)

Import pyspark

From pyspark import SparkContext, SparkConf

Conf = SparkConf () .setAppName ("test") .setMaster ("local [4]")

Sc = SparkContext (conf=conf)

Print ("spark version:", pyspark.__version__)

Rdd = sc.parallelize (["hello", "spark"])

Print (rdd.reduce (lambda xonomer) x'+ y))

The above is how to use big data's tool pyspark. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report