In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article shows you how to get started with Spark. The content is concise and easy to understand. It will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
1. Brief introduction to Spark
Spark was born in 2009 at the AMPLab Lab at the University of Berkeley. Spark is just an experimental project with a very small amount of code and a lightweight framework. In 2010, Berkeley officially opened up the Spark project. In June 2013, Spark became a project under the Apache Foundation and entered a period of rapid development. Third-party developers have contributed a lot of code, and activity is very high. In February 2014, Spark was called Apache's top project at breakneck speed. At the same time, big data's Cloudera announced that it would increase the investment in Spark framework to replace MapReduce in April 2014. Big data's MapR joined the Spark camp, and Apache Mahout abandoned MapReduce to use Spark as the computing engine. Spark 1.0.0 was released in May 2014. Since 2015, Spark has become more and more popular in the domestic IT industry, and more and more companies have begun to deploy or use Spark to replace traditional big data parallel computing frameworks such as MR2, Hive, Storm, etc.
2. What is Spark?
Apache Spark ™is a unified analytics engine for large-scale data processing. Spark, a unified analysis engine for large-scale data sets, is a general parallel computing framework based on memory, which aims to make data analysis faster. Spark includes various computing frameworks commonly used in the field of big data: spark core (offline computing) sparksql (interactive query) spark streaming (real-time computing) Spark MLlib (machine learning) Spark GraphX (graph computing)
3. Can Spark replace hadoop?
Not exactly right.
Because we can only use spark core instead of mr for offline computing, the storage of data still depends on hdfs
The combination of Spark+Hadoop is the hottest combination and the most promising combination in the field of big data in the future!
4. Characteristics of Spark
Speed
Memory computing is 100 times faster than mr, disk computing is more than 10 times faster than mr.
Easy to use
Provides api interface for java scala python R language
One-stop solution
Spark core (offline Computing) spark sql (Interactive query) spark streaming (Real-time Computing).
Can run on any platform
Yarn Mesos standalone
5. Shortcomings of Spark
JVM's memory overhead is too large, and 1 gigabyte of data usually consumes 5 gigabytes of memory (Project Tungsten is trying to solve this problem)
Lack of an effective shared memory mechanism between different spark app (Project Tachyon is trying to introduce distributed memory management so that different spark app can share cached data)
6. Spark vs MR
6.1 limitations of mr
Low level of abstraction, need to write code by hand to complete, it is difficult to use only two operations, Map and Reduce, lack of expression of a Job only two stages of Map and Reduce (Phase), complex computing requires a large number of Job to complete, the dependency between Job is managed by the developers themselves (the output of reduce) is also put in the HDFS file system with high latency, only suitable for Batch data processing For interactive data processing, the support of real-time data processing is not enough for iterative data processing.
6.2 which problems in mr have been solved by Spark?
The level of abstraction is low, and it needs to be written by hand, so it is difficult to use.
Abstract through RDD (Resilient distributed datasets) in spark
Only two operations, Map and Reduce, are provided and lack of expressiveness.
Multiple operators are provided in spark
There are only two stages of a Job: Map and Reduce.
There can be multiple phases in spark (stage)
The intermediate results are also placed in the HDFS file system (slow)
The intermediate result is stored in memory and will be written to the local disk instead of HDFS.
High latency, only suitable for Batch data processing, for interactive data processing, real-time data processing is not enough.
Sparksql and sparkstreaming solved the above problem.
The performance of iterative data processing is poor.
Improve the performance of iterative computing by caching data in memory
Therefore, it is the trend of technological development that Hadoop MapReduce will be replaced by the new generation of big data processing platform, and in the new generation of big data processing platform, Spark is currently the most widely recognized and supported.
7. Version of Spark
Spark1.6.3: version 2.10.5 of scala spark2.2.0: version 2.11.8 of scala (version of spark2.x is recommended for new projects) hadoop2.7.5
8. Installation of Spark stand-alone version
Prepare to install the package spark-2.2.0-bin-hadoop2.7.tgz
Tar-zxvf spark-2.2.0-bin-hadoop2.7.tgz-C / opt/mv spark-2.2.0-bin-hadoop2.7/ spark
Modify spark-env.sh
Export JAVA_HOME=/opt/jdkexport SPARK_MASTER_IP=uplooking01export SPARK_MASTER_PORT=7077export SPARK_WORKER_CORES=4export SPARK_WORKER_INSTANCES=1export SPARK_WORKER_MEMORY=2gexport HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
Configure environment variables
# configure the environment variable export SPARK_HOME=/opt/sparkexport PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin of Spark
Start the stand-alone spark
Start-all-spark.sh
View Startup
Http://uplooking01:8080
9. Installation of Spark distributed cluster
Configure spark-env.sh
[root@uplooking01 / opt/spark/conf] export JAVA_HOME=/opt/jdk # configure master host export SPARK_MASTER_IP=uplooking01 # configure master host communication port export SPARK_MASTER_PORT=7077 # configure cpu core number spark uses in each worker export SPARK_WORKER_CORES=4 # configure each host have one worker export SPARK_WORKER_INSTANCES=1 # worker memory is the directory export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop in the configuration file of 2gb export SPARK_WORKER_MEMORY=2g # hadoop
Configure slaves
[root@uplooking01 / opt/spark/conf] uplooking03 uplooking04 uplooking05
Distribute spark
[root@uplooking01 / opt/spark/conf] scp-r / opt/spark uplooking02:/opt/ scp-r / opt/spark uplooking03:/opt/ scp-r / opt/spark uplooking04:/opt/ scp-r / opt/spark uplooking05:/opt/
Distribute environment variables configured on uplooking01
[root@uplooking01 /] scp-r / etc/profile uplooking02:/etc/ scp-r / etc/profile uplooking03:/etc/ scp-r / etc/profile uplooking04:/etc/ scp-r / etc/profile uplooking05:/etc/
Start spark
[root@uplooking01 /] start-all-spark.sh
10. Spark high availability cluster
Stop the running spark cluster first
Modify spark-env.sh
# comment the following two lines # export SPARK_MASTER_IP=uplooking01#export SPARK_MASTER_PORT=7077
Add content
Export SPARK_DAEMON_JAVA_OPTS= "- Dspark.deploy.recoveryMode=ZOOKEEPER-Dspark.deploy.zookeeper.url=uplooking03:2181,uplooking04:2181,uplooking05:2181-Dspark.deploy.zookeeper.dir=/spark"
Distribute modified [configuration
Scp / opt/spark/conf/spark-env.sh uplooking02:/opt/spark/confscp / opt/spark/conf/spark-env.sh uplooking03:/opt/spark/confscp / opt/spark/conf/spark-env.sh uplooking04:/opt/spark/confscp / opt/spark/conf/spark-env.sh uplooking05:/opt/spark/conf
Start the cluster
[root@uplooking01 /] start-all-spark.sh [root@uplooking02 /] start-master.sh
11. The first Spark-Shell program
Spark-shell-- master spark://uplooking01:7077 # spark-shell can specify the resources (total number of cores, memory used on each work) used by the spark-shell application at startup-- master spark://uplooking01:7077-- total-executor-cores 6-- executor-memory 1g# if you do not specify the default use of all cores on each worker, and 1g of memory on each worker
Sc.textFile ("hdfs://ns1/sparktest/"). FlatMap (_ .split (","). Map ((_, 1)). ReduceByKey (_ + _). Collect
12. Roles in Spark
Master
The request master responsible for receiving submitted jobs is responsible for scheduling resources (starting CoarseGrainedExecutorBackend in woker)
Worker
Executor in worker is responsible for executing task
Spark-Submitter = = > Driver
Submit the spark application to master
The above is how to get started with Spark. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 264
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.