In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
I. introduction of big data's related work
Second, the skill requirements of big data engineer
Big data's study Plan
Introduction of big data
In view of the above four main characteristics, we need to consider the following questions:
With a wide range of data sources, how to collect and summarize? Corresponding to the emergence of tools such as Sqoop,Cammel,Datax
After data collection, how to store it? Corresponding to the emergence of distributed file storage systems such as GFS,HDFS,TFS.
Because of the rapid growth of data, data storage must be scalable horizontally.
After the data is stored, how to quickly convert it into a consistent format through operations, and how to quickly calculate the results you want?
The corresponding distributed computing framework such as MapReduce solves this problem; but writing MapReduce requires a large amount of Java code, so there are parsing engines such as Hive,Pig that convert SQL into MapReduce.
Ordinary MapReduce processing data can only be processed in batches, and the time delay is too long. In order to achieve the result of each input of data, a low-latency streaming computing framework such as Storm/JStorm has emerged.
However, if you need both batch and stream processing, you have to build two clusters, Hadoop cluster (including HDFS+MapReduce+Yarn) and Storm cluster, which is not easy to manage, so there is an one-stop computing framework such as Spark, which can process both batch and stream (in essence, micro-batch processing).
Then the emergence of Lambda architecture and Kappa architecture provides a general architecture for business processing.
In order to improve work efficiency and speed up transportation, some auxiliary tools have emerged:
Ozzie,azkaban: a tool for scheduled task scheduling.
Hue,Zepplin: graphical task execution management, result viewing tool.
Scala language: the best language to write Spark programs, of course, you can choose to use Python.
Python language: used when writing some scripts.
Allluxio,Kylin, etc.: tools to speed up the operation speed by preprocessing the stored data.
The above roughly enumerates the problems solved by the tools used in big data's ecology, and knows why they appear or what problems they appear to solve.
Text
I. introduction of big data's related work
At present, big data's work is mainly divided into three main directions:
Engineer big data
Data analyst
Big data scientist
Other (data mining, etc.)
Second, the skill requirements of big data engineer
Attached is the skill map of big data engineer:
11 skills that must be mastered
Java Advanced (virtual machine, concurrent)
Basic operation of Linux
Hadoop (HDFS+MapReduce+Yarn)
HBase (JavaAPI operation + Phoenix)
Hive (basic operation and principle understanding of Hql)
Kafka
Storm/JStorm
Scala
Python
Spark (Core+sparksql+Spark streaming)
Auxiliary gadgets (Sqoop/Flume/Oozie/Hue, etc.)
6 high-level skills
Machine learning algorithm and mahout library plus MLlib
R language
Lambda architecture
Kappa architecture
Kylin
Alluxio
III. Learning path
Suppose you can set aside 3 hours of effective study time every day, plus 10 hours of effective study time every day on weekends.
There will be (213-4210) 3-423 hours of study in 3 months.
The first stage (basic stage)
1) Linux study (learn ok from Birdbrother)-20 hours
Introduction and installation of Linux operating system.
Linux common commands.
Linux commonly used software installation.
Linux network.
Firewall.
Shell programming and so on.
2) Java advanced learning ("in-depth understanding of Java virtual machine", "Java high concurrency combat")-30 hours
Master multithreading.
Master and send the queue under the package.
Learn about JMS.
Master JVM technology.
Master reflection and dynamic proxy.
3) Zookeeper learning
Introduction to Zookeeper distributed coordination service.
Installation and deployment of the Zookeeper cluster.
Zookeeper data structures, commands.
The principle and election mechanism of Zookeeper.
The second stage (attack the key stage)
4) Hadoop ("Hadoop authoritative Guide")-80 hours
HDFS
The concept and characteristics of HDFS.
Shell operation of HDFS.
The working mechanism of HDFS.
Java application development of HDFS.
MapReduce
Run the WordCount sample program.
Understand the inner workings of MapReduce.
The MapReduce program runs process parsing.
The mechanism for determining the number of MapTask concurrency.
Application of combiner components in MapReduce.
Serialization framework and application in MapReduce.
Sort in MapReduce.
Custom partition implementation in MapReduce.
The shuffle mechanism of MapReduce.
MapReduce uses data compression for optimization.
The relationship between MapReduce program and YARN.
Optimization of MapReduce parameters.
Java Application Development of MapReduce
5) Hive ("Hive Development Guide")-20 hours
Basic concepts of Hive
Hive application scenarios.
The relationship between Hive and hadoop.
Hive is compared with traditional databases.
Data storage mechanism of Hive.
Basic operation of Hive
DDL operation in Hive.
How to implement efficient JOIN query in Hive.
Built-in function application of Hive.
Advanced usage of Hive shell.
Hive common parameter configuration.
Hive custom functions and the use of Transform skills.
Hive UDF/UDAF development example.
Hive execution process Analysis and Optimization Strategy
6) HBase ("HBase authoritative Guide")-20 hours
Introduction to hbase.
Habse installation.
Hbase data model.
The hbase command.
Hbase development.
Hbase principle.
7) Scala ("learn Scala")-20 hours
Overview of Scala.
Scala compiler installation.
Basic of Scala.
Arrays, maps, tuples, collections.
Class, object, inheritance, trait.
Pattern matching and sample classes.
Understand Scala Actor concurrent programming.
Understand Akka.
Understand Scala higher order functions.
Understand Scala implicit conversion.
8) Spark ("Spark authoritative Guide")-60 hours
Enter image description here
Spark core
Overview of Spark.
Spark cluster installation.
Execute the first Spark case program (ask for PI).
RDD
Enter image description here
Overview of RDD.
Create a RDD.
RDD programming API (Transformation and Action Operations).
Dependency relationship of RDD
Caching of RDD
DAG (directed acyclic graph)
Spark SQL and DataFrame/DataSet
Enter image description here
Overview of Spark SQL.
DataFrames .
DataFrame common operations.
Write Spark SQL query program.
Spark Streaming
Enter image description here
Enter image description here
Overview of park Streaming.
Understand DStream.
DStream related operations (Transformations and Output Operations).
Structured Streaming
Other (MLlib and GraphX)
If this part of the general work is not data mining, machine learning is generally not needed, you can wait until it is needed before in-depth study.
9) Python
10) build a cluster with virtual machines, install all the tools, and develop a small demo for 30 hours
You can use VMware to build 4 virtual machines, and then install the above software to build a small cluster (I personally test, I7 64-bit, 16 GB memory, can be run)
Big data's future prospect is promising, and there are many people in the profession, and how to quickly complete the transformation, how to quickly enter the field of big data, we need transformers and rookies to think deeply.
For Xiaobai to learn big data need to pay attention to a lot of points, but in any case, since you chose to enter the big data industry, then only pay attention to both ups and downs. As the saying goes, do not forget the original ideal and ambition, in order to complete the mission, what you need most to learn big data is a persevering heart.
Here I still want to recommend the big data Learning Exchange Group I built myself: 529867072, all of them are developed by big data. If you are studying big data, the editor welcomes you to join us. Everyone is a software development party. Irregularly share practical information (only related to big data software development), including the latest big data advanced materials and advanced development tutorials sorted out by myself. Welcome to join us if you want to go deep into big data.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.