In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Most people do not understand the knowledge points of this article "how to become a big data Spark master", so the editor summarizes the following contents for you. The content is detailed, the steps are clear, and it has certain reference value. I hope you can gain something after reading this article. Let's take a look at this "how to become a big data Spark master" article.
The first stage: proficient in Scala and java
The Spark framework is written in Scala, exquisite and elegant. To become a master of Spark, you must read the source code of Spark and master Scala.
Although today's Spark can use multi-language Java, Python and so on for application development, the fastest and best development API is still and will always be Scala API, so you must master Scala to write complex and high-performance Spark distributed programs.
In particular, it is necessary to be proficient in trait, apply, functional programming, generics, inverter and covariant of Scala.
Master JAVA language multithreading, netty,rpc,ClassLoader, running environment, etc. (source code required).
Phase 2: proficiency in the Spark platform itself provides developers with API
Master the deployment mode of RDD-oriented development model in Spark: local (debugging), Standalone,yarn, etc., master the use of various transformation and action functions
Master wide and narrow dependencies in Spark as well as lineage mechanism
Master the computing flow of RDD, such as the partition of Stage, the basic process of submitting Spark applications to the cluster and the basic working principle of Worker nodes, etc.
Proficient in the mechanism and tuning of spark on yarn
Phase 3: go deep into the Spark kernel
This stage is mainly through the source code study of the Spark framework to go deep into the Spark kernel:
Master the task submission process of Spark through the source code
Master the task scheduling of Spark cluster through source code
In particular, you should be proficient in the details of every step of the work within the DAGScheduler, TaskScheduler,Driver, and Executor nodes.
Running Environment and RPC process of Driver and Executor
Cache or temporary garbage removal mechanism such as cache RDD,Checkpoint,Shuffle
Proficient in BlockManager,Broadcast,Accumulator, cache and other mechanism principles
Proficient in Shuffle principle source code and tuning
Class 4: mastery based on Spark Streaming
Spark is a master of cloud computing big data era, and its component spark Streaming is also a basic prerequisite for quasi-real-time processing in enterprises, so it is necessary and necessary for big data practitioners to master it proficiently:
Spark Streaming is an excellent real-time stream processing framework. It is necessary to master its DStream, transformation, checkpoint, etc.
Skillfully master the two ways of combining kafka and spark Streaming and tuning methods
Proficient in the principle and function of Structured Streaming and the rest of kafka combination
Proficient in the source code of SparkStreaming, especially the source principle of the two ways of combining with kafka.
Proficient in spark Streaming web ui and various indicators, such as: batch execution event processing time, scheduling delay, waiting queue and will be tuned according to these indicators.
Will customize the monitoring system
Class 5: mastery based on Spark SQL
In the enterprise environment, data warehouses are still the majority. In view of the high demand for real-time performance, spark sql is our favorite as the warehouse analysis engine (the two clusters that Langjian is responsible for are mainly Computational Analysis-spark sql):
Spark sql needs to understand the concept of Dataset and its difference from RDD, various operators.
To understand the difference between permanent tables generated based on hive and temporary tables without hive
Spark sql+hive metastore is basically standard, whether it is sql support or permanent table features.
To master the storage format and performance comparison
Spark sql should also be familiar with how its optimizer, catalyst, works.
Spark Sql's dataset chain calculation principle, logical plan translated into physical plan source code (non-essential, less sql source code tuning is involved in interviews and enterprises)
Class 6: master spark-based machine learning and graph computing
The use of spark as a machine learning and deep learning analysis engine in enterprise environments is also increasing, and there are many ways to combine it:
Java system:
Spark ml/mllib spark's own machine learning library gradually has open source deep learning and nlp frameworks (spaCy, CoreNLP, OpenNLP, Mallet, GATE, Weka, UIMA, nltk, gensim, Negex, word2vec, GloVe).
A form that is more commonly used than DeepLearning4j at present.
Python system:
Pyspark
Combination of spark and TensorFlow
Class 7: mastering the ecological edge related to spark
The use of spark in enterprises will certainly involve the marginal ecology of spark. Here are a few commonly used software frameworks:
Hadoop series: kafka,hdfs,yarn
Input source and result output, mainly: mysql/redis/hbase/mongod
Memory accelerated framework redis,Alluxio
Es 、 solr
Class 8: do business-level Spark projects
Through a complete and representative Spark project to run through all aspects of Spark, including the architecture design of the project, analysis of the technology used, development and implementation, operation and maintenance, etc., a complete grasp of each stage and details, so that you can calmly face the vast majority of Spark projects in the future.
Class 9: provide Spark solutions
Thoroughly grasp every detail of the source code of Spark framework
Provide Spark solutions under different scenarios according to the needs of different business scenarios.
According to the actual needs, carry on the secondary development on the basis of Spark framework, and build our own Spark framework.
The above is about the content of this article on "how to become a master of big data Spark". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more related knowledge, please pay attention to the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.