Getting started with 4.spark 10/18 Update SLTechnology News&Howtos

Getting started with 4.spark

2025-10-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The spark framework is written in scala and runs on the Java Virtual Machine (JVM). Client applications written in Python, Java, Scala or R are supported.

Download Spark

Visit http://spark.apache.org/downloads.html to select the precompiled version for download.

Unpack Spark

Open the terminal, change the working path to the directory where the downloaded spark compressed package is located, and then decompress the compressed package.

You can use the following command:

cd ~tar -xf spark-2.2.2-bin-hadoop2.7.tgz -C /opt/module/cd spark-2.2.2-bin-hadoop2.7ls

Note: The x tag in tar specifies that tar performs the decompression operation, and the f tag specifies the file name of the compressed package.

spark main directory structure README.md

Contains simple instructions to get started with spark

bin

Contains a series of executable files that you can use to interact with spark in various ways

core、streaming、python

Contains the source code for the main components of the spark project

examples

Contains a number of viewable and operational spark procedures, very helpful for learning spark API

Run Case and Interactive Shell Run Case./ bin/run-example SparkPi 10scala shell./ bin/spark-shell --master local[2] # --master option specifies operating mode. Local means running locally with one thread;local[N] means running locally with N threads. python shell./ bin/pyspark --master local[2]R shell./ bin/sparkR --master local[2] Submit app script #Support multilingual submissions./ bin/spark-submit examples/src/main/python/pi.py 10./ bin/spark-submit examples/src/main/r/dataframe.R... Interactive analysis using the spark shell scala

Interactive analysis using spark-shell scripts.

base scala> val textFile = spark.read.textFile("README.md")textFile: org.apache.spark.sql.Dataset[String] = [value: string]scala> textFile.count() // Number of items in this Datasetres0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputsscala> textFile.first() // First item in this Datasetres1: String = # Apache Spark#Returns a subset of the original DataSet using the filter operator scala> val linesWithSpark = textFile.filter (line => line.contains("Spark"))linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]#Zipper way scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15 Advanced #Find the line with the most words using transformations and actions of the DataSet scala> textFile.map (line => line.split(" ").size).reduce(((a, b) => if (a > b) a else b)res4: Long = 15#Count words scala> val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()wordCounts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]scala> wordCounts.collect()res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster., 1), ...) python

Interactive analysis using pyspark scripts

Base>>> textFile = spark.read.text("README.md")>>> textFile.count() # Number of rows in this DataFrame126>> textFile.first() # First row in this DataFrameRow(value=u'# Apache Spark')#filter>> linesWithSpark = textFile.filter(textFile. value.contains("Spark"))#Pull Chain>> textFile.filter(textFile.value.contains("Spark")).count() # How many lines contain "Spark"? 15 Advanced #Find the line with the most words>>> from pyspark.sql.functions import *>>> textFile.select (size(split(textFile.value, "\s+").name("numWords").agg(max(col("numWords"))).collect()[Row(max(numWords)=15)]#Count words>>> wordCounts = textFile.select(explode(split(textFile.value, "\s+")).alias("word").groupBy("word").count()>> wordCounts.collect()[Row(word=u' online', count=1), Row(word=u' graphs', count = 1),...] standalone application

In addition to running interactively, spark can also be used in conjunction with stand-alone programs in Java, Scala, or Python.

The main difference between a standalone app and a shell is that it needs to initialize SparkContext itself.

scala

Count the number of rows containing word a and word b respectively

/* SimpleApp.scala */import org.apache.spark.sql.SparkSessionobject SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val spark = SparkSession.builder.appName("Simple Application").getOrCreate() val logData = spark.read.textFile(logFile).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs, Lines with b: $numBs") spark.stop() }}

running applications

# Use spark-submit to run your application$ YOUR_SPARK_HOME/bin/spark-submit \ --class "SimpleApp" \ --master local[4] \ target/scala-2.11/simple-project_2.11-1.0.jar... Lines with a: 46, Lines with b: 23java

Count the number of rows containing word a and word b respectively

/* SimpleApp.java */import org.apache.spark.sql.SparkSession;import org.apache.spark.sql.Dataset;public class SimpleApp { public static void main(String[] args) { String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate(); Dataset logData = spark.read().textFile(logFile).cache(); long numAs = logData.filter(s -> s.contains("a")).count(); long numBs = logData.filter(s -> s.contains("b")).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); spark.stop(); }}

running applications

# Use spark-submit to run your application$ YOUR_SPARK_HOME/bin/spark-submit \ --class "SimpleApp" \ --master local[4] \ target/simple-project-1.0.jar... Lines with a: 46, Lines with b: 23python

Count the number of rows containing word a and word b respectively

setup.py script add content install_requires=[ 'pyspark=={site.SPARK_VERSION}']"""SimpleApp.py"""from pyspark.sql import SparkSessionlogFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your systemspark = SparkSession.builder().appName(appName).master(master).getOrCreate()logData = spark.read.text(logFile).cache()numAs = logData.filter(logData.value.contains('a')).count()numBs = logData.filter(logData.value.contains('b')).count()print("Lines with a: %i, lines with b: %i" % (numAs, numBs))spark.stop()

running applications

# Use spark-submit to run your application$ YOUR_SPARK_HOME/bin/spark-submit \ --master local[4] \ SimpleApp.py... Lines with a: 46, Lines with b: 23

Loyalty to technology, love to share. Welcome to pay attention to the public number: java big data programming, learn more about technical content.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.