Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What can SparkMagic do?

2025-03-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "SparkMagic can do". In the operation process of actual cases, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

SparkMagic for Jupyter NoteBook

Sparkmagic is a project that interacts with remote Spark clusters in Jupyter Notebook via the Livy REST API. It provides a set of Jupyter Notebook unit magic and kernels that turn Jupyter into an integrated Spark environment for remote clustering.

SparkMagic can:

Running Spark code in multiple languages

Provides visual SQL queries

Easy access to Spark app logs and information

Automatically create SparkSession with SparkContext and HiveContext for any remote Spark cluster

Capture the output of Spark queries as a native Pandas data framework for easy interaction with other Python libraries (e.g. matplotlib)

Send local files or Pandas data frames to remote clusters (e.g. send pre-trained local ML models directly to Spark clusters)

You can use the following Dockerfile to build Jupyter Notebook with SparkMagic support:

FROM jupyter/all-spark-notebook:7a0c7325e470USER$NB_USER RUN pip install --upgrade pip RUN pip install --upgrade --ignore-installed setuptools RUN pip install pandas --upgrade RUN pip install sparkmagic RUN mkdir /home/$NB_USER/.sparkmagic RUN wget https://raw.githubusercontent.com/jupyter-incubator/sparkmagic/master/sparkmagic/example_config.json RUN mv example_config.json /home/$NB_USER/.sparkmagic/config.json RUN sed -i 's/localhost:8998/host.docker.internal:9999/g'/home/$NB_USER/.sparkmagic/config.json RUN jupyter nbextension enable --py --sys-prefix widgetsnbextension RUN jupyter-kernelspec install --user --name SparkMagic $(pip show sparkmagic |grep Location | cut -d" " -f2)/sparkmagic/kernels/sparkkernel RUN jupyter-kernelspec install --user --name PySparkMagic $(pip show sparkmagic| grep Location | cut -d" " -f2)/sparkmagic/kernels/pysparkkernel RUN jupyter serverextension enable --py sparkmagic USER root RUN chown $NB_USER /home/$NB_USER/.sparkmagic/config.json CMD ["start-notebook.sh","--NotebookApp.iopub_data_rate_limit=1000000000"] USER $NB_USER

Generate images and label them with the following code:

docker build -t sparkmagic

And launch the local Jupyter container with Spark Magic support to install the current working directory:

docker run -ti --name\"${PWD##*/}-pyspark\" -p 8888:8888 --rm -m 4GB --mounttype=bind,source=\"${PWD}\",target=/home/jovyan/work sparkmagic

In order to be able to connect to the Livy REST API on a remote Spark cluster, you must use ssh port forwarding on your local machine. Get your remote cluster IP address and run:

ssh -L 0.0.0.0:9999:localhost:8998REMOTE_CLUSTER_IP

First, create a new Notebook using the PySpark kernel with SparkMagic enabled, as follows:

In Notebook with SparkMagic enabled, you can use a series of unit magic to work on local laptops as well as remote Spark clusters as integrated environments. %% help Magic Output all available magic commands:

Remote Spark applications can be configured using %%configuremagic:

As shown, SparkMagic automatically starts a remote PySpark session and provides some useful links to connect to the Spark UI and logs.

Notebook integrates two environments:

%%local, unit can be executed locally in anaconda environment provided by laptop and jupyter docker image

%%spark, via PySpark REPL on remote Spark cluster, via remote execution unit via Livy REST API

First, it remotely imports the following code cell into the SparkSql data type; second, it uses the remote SparkSession to load the Enigma-JHU Covid-19 dataset into our remote Spark cluster. You can see the output of the remote .show() command in Notebook:

But this is where the magic begins. You can register data frames as Hive tables and use %%sql magic to perform Hive queries on data on remote clusters and automatically display the results in your local Notebook. This isn't difficult, but it's handy for data analysts and for quick data exploration early in a data science project.

Where SparkMagic is really useful is in enabling seamless data transfer between local notebooks and remote clusters. The daily challenge for data scientists is to create and maintain their Python environment while collaborating with ad hoc clusters to interact with their company's data lake.

In the following example, we can see how seaborn is imported as a local library and used to draw the covid_data pandas data frame.

Where does this data come from? It is created and sent by a remote Spark cluster. The magic %%spark-o allows us to define a remote variable to transfer to the local notebook context when the unit executes. Our variable covid_data is SparkSQL DataFrame on a remote cluster and PandasDataFrame in a local JupyterNotebook.

The ability to aggregate big data from remote clusters in Jupyter Notebook using Pandas to work locally is very helpful for data exploration. For example, use Spark to pre-aggregate histogram data into bins to draw histograms in Jupyter using pre-aggregated counts and simple bar charts.

Another useful feature is the ability to sample remote Spark DataFrames using magic %%spark-o covid_data -m sample -r 0.5. The integrated environment also allows you to send local data to remote Spark clusters using the magic %%send_to_spark.

PandasDataFrames and String support two data types. To send something else more or complex (e.g., a trained scikit model for scoring) to a remote Spark cluster, you can use serialization to create a string representation for transmission:

import pickle import gzip import base64serialised_model = base64.b64encode( gzip.compress( pickle.dumps(trained_scikit_model) ) ).decode()

But as you can see, this short-lived PySpark clustering pattern has a major downside: using Python packages to boot EMR clusters, and this problem doesn't go away with the deployment of production workloads.

"SparkMagic can do what" content is introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report