Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of related knowledge of PySpark SQL

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly shows you the "sample analysis of PySpark SQL-related knowledge", which is easy to understand and well-organized. I hope it can help you solve your doubts. Let the editor lead you to study and learn the article "sample Analysis of PySpark SQL-related knowledge".

1 brief introduction of big data

Big data is one of the hottest topics in this era. But what is big data? It describes a huge data set and is growing at an alarming rate. Big data in addition to volume (Volume) and speed (velocity), data diversity (variety) and accuracy (veracity) is also a major feature of big data. Let's discuss volume, speed, diversity and accuracy in detail. These are also known as big data's 4V features.

1.1 Volume

The data volume (Volume) specifies the amount of data to be processed. For large amounts of data, we need large machines or distributed systems. The calculation time increases with the increase of data. So if we can parallelize computing, it's best to use a distributed system. Data can be structured data, unstructured data, or data in between. If we have unstructured data, the situation will become more complex and computationally intensive. You may wonder, how old is big data? This is a controversial issue. But generally speaking, we can say that the amount of data that we cannot handle with traditional systems is defined as big data. Now let's talk about the speed of the data.

1.2 Velocity

More and more organizations begin to pay attention to data. A large amount of data is collected all the time. This means that the speed of data is increasing. How does a system handle this speed? The problem becomes complicated when large inflows of data have to be analyzed in real time. Many systems are being developed to handle this huge data inflow. Another factor that distinguishes traditional data from big data is the diversity of data.

1.3 Variety

The diversity of data makes it so complex that the traditional data analysis system can not analyze it correctly. Which one are we talking about? Isn't data just data? Image data is different from tabular data because it is organized and saved differently. You can use an unlimited number of file systems. Each file system needs a different way to deal with it. Reading and writing JSON files is different from working with CSV files. Now, data scientists must deal with combinations of data types. The data you are going to work with may be a combination of pictures, videos, text, and so on. Big data's diversity makes the analysis more complicated.

1.4 Veracity

Can you imagine a computer program with wrong logic to produce the correct output? Similarly, inaccurate data will provide misleading results. Accuracy, or data correctness, is an important issue. For big data, we must consider the anomalies of the data.

2 introduction to Hadoop

Hadoop is a distributed and scalable framework for solving big data's problems. Hadoop is developed by Doug Cutting and Mark Cafarella. Hadoop is written in Java. It can be installed on a set of commercial hardware and can scale horizontally on distributed systems.

Working on commodity hardware makes it very efficient. If our work is in commodity hardware, failure is an inevitable problem. But Hadoop provides a fault-tolerant system for data storage and computing. This fault tolerance makes Hadoop very popular.

Hadoop has two components: the first component is HDFS (Hadoop Distributed File System), which is a distributed file system. The second component is MapReduce. HDFS is used for distributed data storage, and MapReduce is used to perform calculations on data stored in HDFS.

2.1 HDFS introduction

HDFS is used to store large amounts of data in a distributed and fault-tolerant manner. HDFS is written in Java and runs on ordinary hardware. It was inspired by Google research papers on Google File system (GFS). It is a system that writes once and reads many times, and is effective for a large amount of data. HDFS has two components, NameNode and DataNode.

These two components are Java daemons. NameNode, which is responsible for maintaining the metadata of files distributed on the cluster, is the master node of many datanode. HDFS divides large files into small chunks and saves these blocks on different datanode. The actual file data blocks reside on the datanode. HDFS provides a set of unix-shell-like commands. However, we can use the Java filesystem API provided by HDFS to handle large files at a finer level. Fault tolerance is achieved by copying blocks of data.

We can use parallel single-threaded processes to access HDFS files. HDFS provides a very useful utility called distcp, which is often used to transfer data from one HDFS system to another HDFS system in parallel. It uses parallel mapping tasks to copy data.

2.2 introduction to MapReduce

The computational MapReduce model first appeared in a research paper by Google. Hadoop's MapReduce is the computing engine of the Hadoop framework, which calculates distributed data in HDFS. MapReduce has been found to scale horizontally on distributed systems with commercial hardware. It also applies to big problems. In MapReduce, problem solving is divided into Map stage and Reduce stage. In the Map phase, the data blocks are processed, and in the Reduce phase, aggregate or reduce operations are run on the results of the Map phase. Hadoop's MapReduce framework is also written in Java.

MapReduce is a master-slave model. In Hadoop 1, this MapReduce calculation is managed by two daemons, Jobtracker and Tasktracker. Jobtracker is the main process that handles many task trackers. Tasktracker is the slave node of Jobtracker. But in Hadoop 2, Jobtracker and Tasktracker are replaced by YARN.

We can use the API and Java provided by the framework to write MapReduce code. The Hadoop streaming body module enables programmers with knowledge of Python and Ruby to write MapReduce programs.

The MapReduce algorithm has many uses. For example, many machine learning algorithms are implemented by Apache Mahout, which can be run on Hadoop through Pig and Hive.

But MapReduce is not suitable for iterative algorithm. At the end of each Hadoop job, MapReduce saves the data to HDFS and reads the data again for the next job. We know that reading data into and writing to files is a costly activity. Apache Spark alleviates the shortcomings of MapReduce by providing in-memory data persistence and computing.

For more information about Mapreduce and Mahout, please see the following web page:

Https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean_html/index.html

Https://mahout.apache.org/users/basics/quickstart.html

3 introduction to Apache Hive

Computer science is an abstract world. Everyone knows that data is information in the form of bits. Programming languages like C provide abstractions from machines and assembly languages. Other high-level languages provide more abstraction. Structured query language (Structured Query Language, SQL) is one of these abstractions. Many data modelers around the world are using SQL. Hadoop is very suitable for big data's analysis. So, how do users who know SQL take advantage of Hadoop's computing power on big data? In order to write MapReduce programs for Hadoop, users must know the programming language that can be used to write MapReduce programs for Hadoop.

Everyday problems in the real world follow a certain pattern. Some problems are common in daily life, such as data manipulation, handling missing values, data conversion, and data aggregation. Writing MapReduce code for these everyday problems is a dizzying task for non-programmers. It is not very smart to write code to solve problems. But it is valuable to write efficient code with performance scalability and extensibility. With this in mind, Apache Hive was developed in Facebook to solve day-to-day problems without writing MapReduce code for general problems.

According to the language of Hive wiki, Hive is an Apache Hadoop-based data warehouse infrastructure. Hive has its own SQL dialect, called Hive query language. It is called HiveQL, sometimes also called HQL. Use HiveQL, Hive to query the data in HDFS. Hive runs not only on HDFS, but also on Spark and other big data frameworks, such as Apache Tez.

Hive provides users with an abstraction similar to a relational database management system for structured data in HDFS. You can create a table and run a query like sql on it. Hive saves the table schema in some RDBMS. Apache Derby is the default RDBMS that ships with the Apache Hive distribution. Apache Derby is written entirely in Java and is an open source RDBMS that ships with Apache License Version 2.0.

The HiveQL command is converted to Hadoop's MapReduce code and then run on the Hadoop cluster.

People who know SQL can easily learn Apache Hive and HiveQL, and can use the storage and computing power of Hadoop in big data's daily data analysis work. PySpark SQL also supports HiveQL. You can run the HiveQL command in PySpark SQL. In addition to executing HiveQL queries, you can also read data directly from Hive to PySpark SQL and write the results to Hive

Related links:

Https://cwiki.apache.org/confluence/display/Hive/Tutorial

Https://db.apache.org/derby/

4 introduction to Apache Pig

Apache Pig is a data flow framework for performing data analysis on large amounts of data. It was developed by Yahoo and is open to the Apache Software Foundation. It is now available under the Apache license version 2. 0. Pig programming language is a Pig Latin scripting language. The Pig is loosely connected to the Hadoop, which means we can connect it to the Hadoop and perform a lot of analysis. But Pig can be used with other tools such as Apache Tez and Apache Spark.

Apache Hive is used as a reporting tool, where Apache Pig is used to extract, transform, and load (ETL). We can use user-defined functions (UDF) to extend the functionality of Pig. User-defined functions can be written in many languages, including Java, Python, Ruby, JavaScript, Groovy, and Jython.

Apache Pig uses HDFS to read and store data, and Hadoop's MapReduce executes the algorithm. Apache Pig is similar to Apache Hive in using Hadoop clusters. On Hadoop, the Pig command is first converted to the MapReduce code of Hadoop. Then convert them to MapReduce code, which runs on the Hadoop cluster.

The best part of Pig is to optimize and test the code to handle day-to-day problems. So users can install Pig directly and start using it. Pig provides Grunt shell to run interactive Pig commands. As a result, anyone who knows Pig Latin can enjoy the benefits of HDFS and MapReduce without knowing advanced programming languages such as Java or Python.

Related links

Http://pig.apache.org/docs/

Https://en.wikipedia.org/wiki/Pig_(programming_tool))

Https://cwiki.apache.org/confluence/display/PIG/Index

5 introduction to Apache Kafka

Apache Kafka is a publish-subscribe distributed messaging platform. It was developed by LinkedIn and further opened up to the Apache Foundation. It is fault-tolerant, scalable and fast. The message in Kafka terminology (the smallest unit of data) flows from the producer to the consumer through the Kafka server and can be persisted and used at a later time.

Kafka provides a built-in API that developers can use to build their applications. Next, let's discuss the three main components of Apache Kafka.

5.1 Producer

Kafka Producer generates messages to Kafka topics, which can publish data to multiple topics.

5.2 Broker

This is a Kafka server running on a dedicated machine, and messages are pushed to Broker by Producer. Broker saves the theme in different partitions, which are copied to different Broker to handle errors. It is essentially stateless, so the user must track the messages it consumes.

5.3 Consumer

Consumer gets the message from the Kafka agent. Remember, it gets the message. Kafka Broker does not push messages to Consumer;. Instead, Consumer extracts data from Kafka Broker. Consumer subscribes to one or more topics on Kafka Broker and reads messages. Broker also keeps track of all the messages it uses. The data will be saved in Broker for the specified time. If the consumer fails, it can get the data after reboot.

Related links:

Https://kafka.apache.org/quickstart

Https://kafka.apache.org/documentation/

6 Apache Spark introduction

Apache Spark is a general distributed programming framework. It is considered very suitable for iterating and batch processing data. It was developed in AMP Lab and provides a framework for in-memory computing. It is open source software. On the one hand, it is most suitable for batch processing, on the other hand, it is very effective for real-time or near real-time data. Machine learning and graphics algorithms are iterative in nature, which is the magic of Spark. According to its research papers, it is much faster than its peer Hadoop. Data can be cached in memory. Caching intermediate data in iterative algorithms provides surprisingly fast processing. Spark can be programmed with Java, Scala, Python, and R.

If you think of Spark as an improved Hadoop, to some extent, you can. Because we can implement the MapReduce algorithm in Spark, Spark uses the advantages of HDFS. This means that it can read data from HDFS and store it in HDFS, and it can effectively handle iterative calculations because the data can be kept in memory. In addition to memory computing, it is also suitable for interactive data analysis.

Many other libraries are also located on top of PySpark to make it easier to use PySpark. We will discuss some of the following:

MLlib: MLlib is a wrapper around the PySpark core that handles machine learning algorithms. The machine learning api provided by the MLlib library is very easy to use. MLlib supports a variety of machine learning algorithms, including classification, clustering, text analysis and so on.

ML: ML is also a machine learning library at the core of PySpark. ML's machine learning api can be used for data flow.

GraphFrames: the GraphFrames library provides a set of api for efficient graphical analysis using PySpark core and PySpark SQL.

7 PySpark SQL introduction

Most of the data that data scientists deal with is either structured or semi-structured in nature. To deal with structured and semi-structured data sets, the PySpark SQL module is a higher level of abstraction on top of the PySpark core. We will learn PySpark SQL throughout the book. It is built into PySpark, which means that it does not require any additional installation.

With PySpark SQL, you can read data from many sources. PySpark SQL supports reading from many file format systems, including text files, CSV, ORC, Parquet, JSON, and so on. You can read data from relational database management systems (RDBMS), such as MySQL and PostgreSQL. You can also save analysis reports to many systems and file formats.

7.1 DataFrames

DataFrames is an abstraction, similar to a table in a relational database system. They consist of specified columns. A DataFrames is a collection of row objects that are defined in PySpark SQL. The DataFrames also consists of specified column objects. The user knows the schema in tabular form, so it is easy to manipulate the data stream.

The elements in the DataFrame column will have the same data type. Rows in DataFrame may consist of elements of different data types. The basic data structure is called resilient distributed dataset (RDD). Data flow is a wrapper on RDD. They are RDD or row objects.

Related links:

Https://spark.apache.org/docs/latest/sql-programming-guide.html

7.2 SparkSession

The SparkSession object is the entry point for replacing SQLContext and HiveContext. To make the PySpark SQL code compatible with previous versions, SQLContext and HiveContext will continue to run in PySpark. In the PySpark console, we get the SparkSession object. We can use the following code to create a SparkSession object.

To create the SparkSession object, we must import the SparkSession, as shown below.

From pyspark.sql import SparkSession

After importing SparkSession, we can use SparkSession.builder to operate:

Spark = SparkSession.builder.appName ("PythonSQLAPP") .getOrCreate ()

The appName function sets the name of the application. Function returns an existing SparkSession object. If the SparkSession object does not exist, the getOrCreate () function creates a new object and returns it.

7.3 Structured Streaming

We can use the structured flow framework (the wrapper of PySpark SQL) for stream data analysis. We can use structured flows to stream data to perform analysis in a similar manner, just as we use PySpark SQL to perform batch analysis on static data. Just as the Spark stream module performs stream operations on small batches, the structured flow engine performs flow operations on small batches. The best part of the structured flow is that it uses an API similar to PySpark SQL. Therefore, the learning curve is very high. Optimize the operation of the data flow and structure the flow API in a similar manner in the performance context.

7.4 Catalyst Optimizer

SQL is a declarative language. With SQL, we tell the SQL engine what to do. We don't tell it how to carry out the mission. Similarly, the PySpark SQL command does not tell it how to perform a task. These commands only tell it what to do. Therefore, PySpark SQL queries need to be optimized when performing tasks. The catalyst optimizer performs query optimization in PySpark SQL. PySpark SQL queries are translated into low-level resilient distributed dataset (RDD) operations. The catalyst optimizer first converts the PySpark SQL query into a logical plan, and then converts this logical plan into an optimized logical plan. Create a physical plan from this optimized logical plan. Create multiple physical plans. Use a cost analyzer to choose the best physical solution. Finally, create the low-level RDD operation code.

8 Cluster Manager (Cluster Managers)

In a distributed system, jobs or applications are divided into different tasks that can be run in parallel on different machines in the cluster. If the machine fails, you must reschedule the task on another machine.

Due to poor resource management, distributed systems are usually faced with scalability problems. Consider a job that is already running on the cluster. Another person wants to do another job. The second task must wait until the first one is completed. But in this way, we do not make the best use of resources. Resource management is easy to explain, but difficult to implement on distributed systems. The cluster manager is developed to optimize the management of cluster resources. There are three cluster managers available for Spark standalone, Apache Mesos, and YARN. The best part of these cluster managers is that they provide an abstraction layer between the user and the cluster. Because of the abstraction provided by the cluster manager, the user experience is like working on a machine, even though they work on the cluster. The cluster manager dispatches cluster resources to the running application.

8.1 stand-alone Cluster Manager (Standalone Cluster Manager)

Apache Spark comes with a stand-alone cluster manager. It provides a master-slave architecture to stimulate the cluster. It is a cluster manager that uses only spark. You can only run Spark applications using this stand-alone cluster manager. Its components are the main component and the working component. Workers are slaves to the main process, and it is the simplest cluster manager. You can configure the Spark stand-alone cluster manager using the scripts in the sbin directory of Spark.

8.2 Apache Mesos Cluster Manager (Apache Mesos Cluster Manager)

Apache Mesos is a general-purpose cluster manager. It was developed at the AMP Lab at the University of California, Berkeley. Apache Mesos helps distributed solutions scale effectively. You can use Mesos to run different applications using different frameworks on the same cluster. What is the meaning of different applications from different frameworks? This means that you can run both Hadoop and Spark applications on Mesos. When multiple applications are running on Mesos, they share the resources of the cluster. Apache Mesos has two important components: master component and slave component. This master-slave architecture is similar to the Spark independent cluster manager. Applications that run on Mesos are called frameworks. The slave told the master about the available resources provided as resources. The slave machine provides resources on a regular basis. The allocation module of the master server determines which framework acquires resources.

8.3 YARN Cluster Manager (YARN Cluster Manager)

YARN represents another resource negotiator (Resource Negotiator). YARN was introduced in Hadoop 2 to extend Hadoop. Resource management is separated from job management. Separating the two components makes Hadoop more scalable. The main components of YARN are Resource Manager (Resource Manager), Application Manager (Application Master) and Node Manager (Node Manager). There is a global resource manager, and each cluster will run many node managers. The node manager is a slave to the resource manager. The scheduler is a component of ResourceManager that allocates resources to different applications on the cluster. The best part is that you can run both Spark applications and any other application, such as Hadoop or MPI, on a cluster managed by YARN. Each application has an application master that handles tasks that run in parallel on a distributed system. Additionally, Hadoop and Spark have their own ApplicationMaster.

9 PostgreSQL introduction

Relational database management systems are still very common in many organizations. What does the relationship mean here? Relationship table. PostgreSQL is a relational database management system. It runs on all major operating systems, such as Microsoft Windows, unix-based operating systems, MacOS X, and so on. It is an open source program and the code is available under the PostgreSQL license. Therefore, you are free to use it and modify it according to your needs.

PostgreSQL databases can be connected through other programming languages such as Java, Perl, Python, C, and C++, and many other languages through different programming interfaces. It can also be programmed using PL/pgSQL (process language / PostgreSQL), a process programming language similar to PL/SQL. You can add custom functions to the database. You can write custom functions in C / C++ and other programming languages. You can also use the JDBC connector to read data from the PostgreSQL from PySpark SQL.

PostgreSQL follows ACID (Atomicity, Consistency, Isolation and)

Durability/ principles of atomicity, consistency, isolation and persistence. It has many features, some of which are unique to PostgreSQL. It supports updatable views, transaction integrity, complex queries, triggers, and so on. PostgreSQL uses a multi-version concurrency control model for concurrency management.

PostgreSQL has received extensive community support. PostgreSQL is designed and developed to be extensible.

10 MongoDB introduction

MongoDB is a document-based NoSQL database. It is an open source distributed database developed by MongoDB. MongoDB is written in C++ and is scalable horizontally. Many organizations use it for back-end databases and many other uses.

MongoDB comes with a mongo shell, which is a JavaScript interface to the MongoDB server. Mongo shell can be used to run queries and perform administrative tasks. On mongo shell, we can also run JavaScript code.

Using PySpark SQL, we can read data from MongoDB and perform analysis. We can also write the results.

11 Cassandra introduction

Cassandra is an open source distributed database with an Apache license. This is a NoSQL database developed by Facebook. It is horizontally scalable and is best suited for dealing with structured data. It provides a high level of consistency and adjustable consistency. It doesn't have a single point of failure. It uses a peer-to-peer distributed architecture to replicate data on different nodes. Nodes exchange information using the gossip protocol.

The above is all the contents of this article "sample Analysis of PySpark SQL-related knowledge". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report