How to convert JSON files to DataFrame 07/12 Update SLTechnology News&Howtos

How to convert JSON files to DataFrame

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to convert JSON files into DataFrame". In daily operation, I believe many people have doubts about how to convert JSON files into DataFrame. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "how to convert JSON files into DataFrame". Next, please follow the editor to study!

One: a simple understanding of SparkSQL.

Spark SQL is a Spark module for structured data processing. Unlike basic Spark RDD API, the interface provided by Spark SQL provides Spark with detailed information about the data and the structure of the calculations being performed. This additional information is used internally by Spark SQL to perform additional optimizations. There are several ways to interact with Spark SQL, including SQL, DataFrames API, and dataset API. The calculation results are the same when the execution engine is used when the independent API/ language uses the expression calculation. This unification means that developers can easily provide the most natural way to express a given transformation based on switching back and forth between various Api.

Spark SQL is a module in Spark, which is mainly used to deal with structured data. The core programming abstraction it provides is DataFrame. At the same time, Spark SQL can be used as a distributed SQL query engine. One of the most important functions of Spark SQL is to query data from Hive.

Second: a simple understanding of DataFrame.

DataFrame is a distributed dataset organized in a named column, which is equivalent to a table in a relational database and a data frames in R/Python (but with more optimizations). DataFrame can be built from many sources, including structured data files, tables in Hive, external relational databases, and RDD.

The next step is to operate on structured and unstructured datasets.

Three: structured datasets: how to convert JSON files to DataFrame

3.1. Two JSON files are placed on HDFS, namely

People.json, the file content is as follows:

{"id": "19", "name": "berg", "sex": "male", "age": 19} {"id": "20", "name": "cccc", "sex": "female", "age": 20} {"id": "21", "name": "xxxx", "sex": "male", "age": 21 {"id": "22", "name": "jjjj", "sex": "female" "age": 21}

Student.json, the file content is as follows:

{"id": "1", "name": "china", "sex": "female", "age": 100} {"id": "19", "name": "xujun", "sex": "male", "age": 22}

3.2 use the API of DataFrame to manipulate data and be familiar with the use of methods in DataFrame:

Public class SparkSqlDemo {private static String appName = "Test Spark RDD"; private static String master = "local"; public static void main (String [] args) {SparkConf conf = new SparkConf (); conf.set ("spark.testing.memory", "269522560000"); JavaSparkContext sc = new JavaSparkContext (master, appName, conf) / / the context of sqlContext is created, and note that it is the starting point of DataFrame. SQLContext sqlContext = new SQLContext (sc); / / the local JSON file is converted to DataFrame DataFrame df = sqlContext.read () .json ("hdfs://192.168.226.129:9000/txt/sparkshell/people.json"); / / the output table structure df.printSchema (); / / displays the contents of the DataFrame. Df.show (); / / Select name df.select (df.col ("name")). Show (); / / Select all people over the age of 21, leaving only the name field df.filter (df.col ("age"). Lt (21). Select ("name"). Show () / / Select name and increase the age field by 1 df.select (df.col ("name"), df.col ("age"). Plus (1). Show (); / / count by age: df.groupBy ("age"). Count (). Show () / / there should be a data record as 2 / / convert another JSON file into DataFrame DataFrame df2 = sqlContext.read (). Json ("hdfs://192.168.226.129:9000/txt/sparkshell/student.json"); df2.show (); / / table association. Df.join (df2,df.col ("id") .equalTo (df2.col ("id")). Show (); / / run SQL: / / programmatically to convert the DataFrame object into a virtual table df.registerTempTable ("people") SqlContext.sql ("select age,count (*) from people group by age"). Show (); System.out.println ("-"); sqlContext.sql ("select * from people"). Show ();}}

Run the SQL query programmatically and return it as a comprehensive result, manipulating the data through the registry and manipulating sql:

Public class SparkSqlDemo1 {private static String appName = "Test Spark RDD"; private static String master = "local"; public static void main (String [] args) {SparkConf conf = new SparkConf (); conf.set ("spark.testing.memory", "269522560000"); JavaSparkContext sc = new JavaSparkContext (master, appName, conf) / / the context of sqlContext is created, and note that it is the starting point of DataFrame. SQLContext sqlContext = new SQLContext (sc); / / the local JSON file is converted to DataFrame DataFrame df = sqlContext.read () .json ("hdfs://192.168.226.129:9000/txt/sparkshell/people.json") / / convert another JSON file to DataFrame DataFrame df2 = sqlContext.read (). Json ("hdfs://192.168.226.129:9000/txt/sparkshell/student.json"); / / programmatically run SQL: / / convert the DataFrame object to a virtual table df.registerTempTable ("people") Df2.registerTempTable ("student"); / / query all data sqlContext.sql ("select * from people") .show () in the virtual table people; / / view a field sqlContext.sql ("select name from people") .show () / / View multiple fields sqlContext.sql ("select name,age+1 from people") .show (); / / filter the value of a field sqlContext.sql ("select name,age from people where age > = 21") .show () / / count group the value of a field sqlContext.sql ("select age,count (*) countage from people group by age") .show (); / / Association: inline. SqlContext.sql ("select * from people inner join student on people.id = student.id") .show () / * +-age | id | name | sex | age | id | name | sex | + 19 | 19 | berg | male | 22 | 19 | xujun | male | +-+ * /}}

Four: unstructured datasets:

The first method uses reflection to infer that the schema RDD contains specific types of objects.

This reflection-based approach leads to more concise code and engineering when you already know that the Schema writes Spark applications.

The second way to create a DataFrames is by allowing you to build a Schema and then apply it to the existing RDD programming interface.

Although this approach is more detailed, it allows you to build DataFrames columns and their types that you don't know until run time.

4.1 unstructured dataset file, user.txt, with the following contents:

1, "Hadoop", 202,202,213,213,224,224,235,235,246,227,227,227,213,224,235,246,220,227,227,220,224,23,23,205,246,227,227,227,213,220,224,23,23,23,205,21,21,2,2,2,2,2,2,2,2,2,2,2,2,21,21,21,21,21,21,21,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,2,3,2,2,2,2,2,3,2,2,246,21,22,22,224,23,23,23,224,23,23,23,235,235,246,227,227,227,2

Spark SQL supports JavaBeans RDD to automatically convert distributed datasets. BeanInfo, which uses reflection to get the schema that defines the table. Currently, Spark SQL does not support containing nested JavaBeans or containing complex types, such as lists or arrays. You can create a JavaBean by creating a class that implements serializable fields that have getter and setter methods.

Public class SparkSqlDemo2 {private static String appName = "Test Spark RDD"; private static String master = "local"; public static void main (String [] args) {SparkConf conf = new SparkConf (); conf.set ("spark.testing.memory", "269522560000"); JavaSparkContext sc = new JavaSparkContext (master, appName, conf) / / the context of sqlContext is created, and note that it is the starting point of DataFrame. SQLContext sqlContext = new SQLContext (sc); / / convert the loaded text file to JavaBean JavaRDD rdd = sc.textFile ("hdfs://192.168.226.129:9000/txt/sparkshell/user.txt") every line; JavaRDD userRDD = rdd.map (new Function () {private static final long serialVersionUID = 1L) Public User call (String line) throws Exception {String [] parts = line.split (","); User user = new User (); user.setId (parts [0] .trim () User.setName (parts [1] .trim ()); user.setAge (Integer.parseInt (parts [2]. Trim ()); return user;}}) / / collect belongs to the action operator Action to submit the job and trigger the operation. List list = userRDD.collect (); for (User user: list) {System.out.println (user);} / / register a table through class reflection DataFrame df = sqlContext.createDataFrame (userRDD, User.class); df.registerTempTable ("user") DataFrame df1 = sqlContext.sql ("SELECT id,name,age FROM user WHERE age > = 21 AND age = 21 AND age

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.