In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
1. Spark SQL and Dataframe
The reason why Spark SQL is the largest and most watched component after Spark core is:
A) be able to handle all storage media and data in various formats (you can also easily extend the functionality of Spark SQL to support more data types, such as KUDO)
B) Spark SQL pushes the computing power of the data warehouse to a new level. Not only is the computing speed invincible (Spark SQL is an order of magnitude faster than Shark and Shark is one order of magnitude faster than Hive), especially when tungsten is mature. What is more important is that it pushes the computational complexity of the data warehouse to a new level in history (Dataframe launched by Spark enables the data warehouse to directly use machine learning, graph computing and other algorithm libraries to mine the deep data value of the data warehouse).
C) Spark SQL (Dataframe,DataSet) is not only the engine of data warehouse, but also the engine of data mining. More importantly, Spark SQL is the engine of scientific calculation and analysis.
D) the later DataFrame made Spark SQL the technological overlord of big data's computing engine (especially with the strong support of the tungsten wire project).
E) Hive+Spark SQL+DataFrame
1) Hive is responsible for cheap data storage
2) Spark SQL is responsible for high-speed calculation
3) DataFrame is responsible for complex data mining
II. DataFrame and RDD
A) both R and Python have DataFrame in DataFrame,Spark. In terms of form, the biggest difference is that it is inherently distributed. You can simply think of DataFrame as a distributed Table, as follows:
NameAgeTelStringIntLongStringIntLongStringIntLongStringIntLongStringIntLongStringIntLong
The form of RDD is as follows:
PersonPersonPersonPersonPersonPerson
RDD does not know the properties of the data row, while DataFrame knows the column information of the data
B) fundamental differences between RDD and DataFrame
RDD takes record as the basic unit, and Spark cannot optimize the internal details of RDD when dealing with RDD, so it is impossible to make further optimization, which greatly limits the performance of Spark SQL.
DataFrame contains metadata information for each record, which means that DataFrame optimizations are based on column internal optimizations, unlike RDD optimizations based on rows.
III. Spark enterprise best practices
Stage 1 file system + C language processing
Phase 2 JavaEE + traditional database (scalability is too poor to support distribution. Even though some databases support distribution, the speed is very slow because of transaction consistency)
Phase 3 Hive hive has limited computing power and is very slow.
Phase 4 Hive to Hive+Spark SQL
Stage 5 Hive+Spark SQL+DataFrame
Stage 6 Hive+Spark SQL+DataFrame+DataSet
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.