In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
The spark framework is written in scala and runs on the Java Virtual Machine (JVM). Client applications written in Python, Java, Scala or R are supported.
Download Spark
Visit http://spark.apache.org/downloads.html to select the precompiled version for download.
Unpack Spark
Open the terminal, change the working path to the directory where the downloaded spark compressed package is located, and then decompress the compressed package.
You can use the following command:
cd ~tar -xf spark-2.2.2-bin-hadoop2.7.tgz -C /opt/module/cd spark-2.2.2-bin-hadoop2.7ls
Note: The x tag in tar specifies that tar performs the decompression operation, and the f tag specifies the file name of the compressed package.
spark main directory structure README.md
Contains simple instructions to get started with spark
bin
Contains a series of executable files that you can use to interact with spark in various ways
core、streaming、python
Contains the source code for the main components of the spark project
examples
Contains a number of viewable and operational spark procedures, very helpful for learning spark API
Run Case and Interactive Shell Run Case./ bin/run-example SparkPi 10scala shell./ bin/spark-shell --master local[2] # --master option specifies operating mode. Local means running locally with one thread;local[N] means running locally with N threads. python shell./ bin/pyspark --master local[2]R shell./ bin/sparkR --master local[2] Submit app script #Support multilingual submissions./ bin/spark-submit examples/src/main/python/pi.py 10./ bin/spark-submit examples/src/main/r/dataframe.R... Interactive analysis using the spark shell scala
Interactive analysis using spark-shell scripts.
base scala> val textFile = spark.read.textFile("README.md")textFile: org.apache.spark.sql.Dataset[String] = [value: string]scala> textFile.count() // Number of items in this Datasetres0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputsscala> textFile.first() // First item in this Datasetres1: String = # Apache Spark#Returns a subset of the original DataSet using the filter operator scala> val linesWithSpark = textFile.filter (line => line.contains("Spark"))linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]#Zipper way scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15 Advanced #Find the line with the most words using transformations and actions of the DataSet scala> textFile.map (line => line.split(" ").size).reduce(((a, b) => if (a > b) a else b)res4: Long = 15#Count words scala> val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()wordCounts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]scala> wordCounts.collect()res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster., 1), ...) python
Interactive analysis using pyspark scripts
Base>>> textFile = spark.read.text("README.md")>>> textFile.count() # Number of rows in this DataFrame126>> textFile.first() # First row in this DataFrameRow(value=u'# Apache Spark')#filter>> linesWithSpark = textFile.filter(textFile. value.contains("Spark"))#Pull Chain>> textFile.filter(textFile.value.contains("Spark")).count() # How many lines contain "Spark"? 15 Advanced #Find the line with the most words>>> from pyspark.sql.functions import *>>> textFile.select (size(split(textFile.value, "\s+").name("numWords").agg(max(col("numWords"))).collect()[Row(max(numWords)=15)]#Count words>>> wordCounts = textFile.select(explode(split(textFile.value, "\s+")).alias("word").groupBy("word").count()>> wordCounts.collect()[Row(word=u' online', count=1), Row(word=u' graphs', count = 1),...] standalone application
In addition to running interactively, spark can also be used in conjunction with stand-alone programs in Java, Scala, or Python.
The main difference between a standalone app and a shell is that it needs to initialize SparkContext itself.
scala
Count the number of rows containing word a and word b respectively
/* SimpleApp.scala */import org.apache.spark.sql.SparkSessionobject SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val spark = SparkSession.builder.appName("Simple Application").getOrCreate() val logData = spark.read.textFile(logFile).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs, Lines with b: $numBs") spark.stop() }}
running applications
# Use spark-submit to run your application$ YOUR_SPARK_HOME/bin/spark-submit \ --class "SimpleApp" \ --master local[4] \ target/scala-2.11/simple-project_2.11-1.0.jar... Lines with a: 46, Lines with b: 23java
Count the number of rows containing word a and word b respectively
/* SimpleApp.java */import org.apache.spark.sql.SparkSession;import org.apache.spark.sql.Dataset;public class SimpleApp { public static void main(String[] args) { String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate(); Dataset logData = spark.read().textFile(logFile).cache(); long numAs = logData.filter(s -> s.contains("a")).count(); long numBs = logData.filter(s -> s.contains("b")).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); spark.stop(); }}
running applications
# Use spark-submit to run your application$ YOUR_SPARK_HOME/bin/spark-submit \ --class "SimpleApp" \ --master local[4] \ target/simple-project-1.0.jar... Lines with a: 46, Lines with b: 23python
Count the number of rows containing word a and word b respectively
setup.py script add content install_requires=[ 'pyspark=={site.SPARK_VERSION}']"""SimpleApp.py"""from pyspark.sql import SparkSessionlogFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your systemspark = SparkSession.builder().appName(appName).master(master).getOrCreate()logData = spark.read.text(logFile).cache()numAs = logData.filter(logData.value.contains('a')).count()numBs = logData.filter(logData.value.contains('b')).count()print("Lines with a: %i, lines with b: %i" % (numAs, numBs))spark.stop()
running applications
# Use spark-submit to run your application$ YOUR_SPARK_HOME/bin/spark-submit \ --master local[4] \ SimpleApp.py... Lines with a: 46, Lines with b: 23
Loyalty to technology, love to share. Welcome to pay attention to the public number: java big data programming, learn more about technical content.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.