How to install, configure and use Spark 07/15 Update SLTechnology News&Howtos

How to install, configure and use Spark

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to install, configure and basic use of Spark, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

7. Spark

This topic describes the installation, configuration, and basic use of Spark.

Spark basic Information

Official website: http://spark.apache.org/ official tutorial: http://spark.apache.org/docs/latest/programming-guide.html7.1. Environment preparation # switch to Workspace cd / opt/workspaces# to create Spark data directory mkdir data/spark# create Spark log directory mkdir logs/spark

Official tutorial

Http://spark.apache.org/docs/latest/spark-standalone.html7.2. Install wget http://mirrors.hust.edu.cn/apache/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgztar-zxf spark-1.6.1-bin-hadoop2.6.tgzrm-rf spark-1.6.1-bin-hadoop2.6.tgzmv spark-1.6.1-bin-hadoop2.6. / frameworks/spark7.3. Configuration (pseudo-distributed)

Vi. / frameworks/spark/conf/spark-env.sh

Export SPARK_MASTER_IP=bdexport SPARK_MASTER_PORT=7077export MASTER=spark://$ {SPARK_MASTER_IP}: ${SPARK_MASTER_PORT} # specify Spark data directory export SPARK_LOCAL_DIRS=/opt/workspaces/data/spark/# specify Spark log directory export SPARK_LOG_DIR=/opt/workspaces/logs/spark/# specify JDK directory export JAVA_HOME=/opt/env/java# specify Scala directory export SCALA_HOME=/opt/env/scala7.4. Start and stop. / frameworks/spark/sbin/start-all.sh7.5. Test # perform an example of pi calculation. / frameworks/spark/bin/run-example org.apache.spark.examples.SparkPi./frameworks/spark/bin/spark-submit\-- class org.apache.spark.examples.SparkPi\-- master spark://bd:6066\-- deploy-mode cluster\-- driver-memory 512m\-- executor-memory 256m\ # if there is an error in running, please make it larger. / frameworks/spark/lib/spark-examples- 1.6.1-hadoop2.6.0.jar\ 10007.6. Word Count

Http://spark.apache.org/docs/latest/quick-start.html

Word Count

. / frameworks/spark/bin/spark-shell// basic val textFile = sc.textFile (". / frameworks/spark/README.md") val words = textFile.flatMap (line = > line.split (")) val exchangeVal = words.map (word = > (word,1)) val count = exchangeVal.reduceByKey ((a) b)) count.collect// optimized sc.textFile (". / frameworks/spark/README.md "). FlatMap (_ .split (")). Map ((_ _) 1). ReduceByKey (_ + _). Collect// with sort sc.textFile (". / frameworks/spark/README.md"). FlatMap (_ .split (")). Map ((_ 1). ReduceByKey (_ + _). Map (_ .swap) .sortByKey (false). Map (_ .swap). Collect// final version val wordR= ""\ w + "" .rsc.textFile (". / frameworks/spark/README.md"). FlatMap (_ .split (")) .filter (wordR.pattern.matcher (_). Matches). Map ((_) 1). ReduceByKey (_ + _). Map (_ .swap) .sortByKey (false). Map (_ .swap) .saveAsTextFile ("hdfs://bd:9000/wordcount")

You can visit http://:8080 to view the job 7.7. Parameter description

Where to configure:

The Spark properties (Spark property) is set in the application through the SparkConf object or through the Java system property.

Environment variables (environment variable) specifies the settings of each node, such as IP address and port, and the configuration file is in conf/spark-env.sh.

Logging (log) logs can be configured through log4j.properties.

Spark properties

Specify the configuration in the code

Val conf = new SparkConf () / / specifies 2 local threads to run. In local mode, we can use n threads (n > = 1), but in scenarios like Spark Streaming, we may need multiple threads. SetMaster ("local [2]") .setAppName ("CountingSheep") val sc = new SparkContext (conf)

Specify the configuration in the script

. / bin/spark-submit-- name "My app"-- master local [4]-- conf spark.eventLog.enabled=false-- conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails-XX:+PrintGCTimeStamps" myApp.jar

Table 1. Common configuration

Attribute name default value description

Spark.app.name

The name of the Spark application

Spark.driver.cores

one

The number of cores running the driver process in cluster mode

Spark.driver.memory

The total amount of memory available to the driver process (e.g. 1g, 2g) has no effect in client mode. You must use-driver-memory on the command line or set it in the default property configuration file.

Spark.executor.memory

The total amount of memory used by a single executor (e.g., 2g, 8g)

Spark.master

Cluster Manager URL

Environment variables

Environment variables are set in the ${SPARK_HOME} / conf/spark-env.sh script

Table 2. Common configuration

Schema property name default value description

JAVA_HOME

Java installation directory

SCALA_HOME

Scala installation directory

SPARK_LOCAL_IP

Locally bound IP

SPARK_LOG_DIR

${SPARK_HOME} / logs

Log directory

Standalone

SPARK_MASTER_IP

(current IP)

Master IP

Standalone

SPARK_MASTER_PORT

7077 (6066)

Master port

Standalone

MASTER

Default Master URL

Standalone

SPARK_WORKER_CORES

All

Upper limit of CPU core used per node

Standalone

SPARK_WORKER_MEMORY

All memory on this node minus 1GB

Upper limit of memory used per node

Standalone

SPARK_WORKER_INSTANCES

one

Number of worker instances started per node

Standalone

SPARK_WORKER_PORT

Random

Port bound by Worker

If your slave node performance is very strong, you can set the SPARK_WORKER_INSTANCES to greater than 1; accordingly, you need to set the SPARK_WORKER_CORES parameter to limit the number of CPU used by each worker instance, otherwise each worker instance will use all CPU.

Logging

Log is set in ${SPARK_HOME} / conf/log4j.properties

Hadoop cluster configuration

When using HDFS, you need to copy hdfs-site.xml and core-site.xml from Hadoop to the classpath of Spark

Http://spark.apache.org/docs/latest/configuration.html7.8. Resource scheduling

Standalone currently supports only simple first-in, first-out (FIFO) schedulers. This scheduler can support multiple users, and you can control the maximum resources used by each application. By default, Spark applications apply for all CPU in the cluster.

Restrict resources in your code

Val conf = new SparkConf () .setMaster (...) .setAppName (...) .set ("spark.cores.max", "10") val sc = new SparkContext (conf)

Restrict resources in the configuration file spark-env.sh

Export SPARK_MASTER_OPTS= "- Dspark.deploy.defaultCores=" 7.9. Performance tuning

Http://spark.apache.org/docs/latest/tuning.html7.10. Hardware configuration

Each node:

* 4-8 disks

* more than 8 GB of memory

* Gigabit network card

* 8-16 core CPU

At least 3 nodes

Http://spark.apache.org/docs/latest/hardware-provisioning.html7.11. Integrate Hive

Add configuration items in spark-env.sh

# Hive directory export HIVE_HOME=$HIVE_HOME

SPARK_CLASSPATH

Some tutorials say you want to add

Export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-x.jar:$SPARK_CLASSPATH

However, this configuration is not required in the current version, and adding it will cause errors in the operation of zeppelin:

Org.apache.spark.SparkException: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.

Copy several configuration files for Hive

Cp. / frameworks/hive/conf/hive-site.xml. / frameworks/spark/confcp. / frameworks/hive/conf/hive-log4j.properties. / frameworks/spark/conf

Start thriftserver to provide JDBC services to the outside world

. / frameworks/spark/sbin/start-thriftserver.sh

Test connection

. / frameworks/spark/bin/beeline!connect jdbc:hive2://bd:10000show tables

These are all the contents of the article "how to install, configure and use Spark". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.