In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article is about how to use Python to define Schema and generate Parquet files, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.
1. Simple field definition
1. Define Schema and generate Parquet file import pandas as pdimport pyarrow as paimport pyarrow.parquet as pq# definition Schemaschema = pa.schema ([('id', pa.int32 ()), (' email', pa.string ()])) # prepare data ids = pa.array ([1,2], type = pa.int32 ()) emails = pa.array (['first@example.com',' second@example.com'] Pa.string () # generate Parquet data batch = pa.RecordBatch.from_arrays ([ids, emails], schema = schema) table = pa.Table.from_batches ([batch]) # write Parquet file plain.parquetpq.write_table (table, 'plain.parquet') import pandas as pdimport pyarrow as paimport pyarrow. Parquet as pq# defines Schemaschema = pa. Schema [('id', pa. Int32 (), ('email', pa. String ()]) # prepare data ids = pa. Array ([1,2], type = pa. Int32 () emails = pa. Array (['first@example.com',' second@example.com'], pa. String () # generates Parquet data batch = pa. RecordBatch. From_arrays ([ids, emails], schema = schema) table = pa. Table. From_batches ([batch]) # write Parquet file plain.parquetpq. Write_table (table, 'plain.parquet') 2, verify the Parquet data file
We can use the tool parquet-tools to view the data and Schema of the plain.parquet file
$parquet-tools schema plain.parquet message schema {optional int32 id; optional binary email (STRING);} $parquet-tools cat-- json plain.parquet {"id": 1, "email": "first@example.com"} {"id": 2, "email": "second@example.com"}
No problem, in line with our expectations. You can also use pyarrow code to get the Schema and data in it
Schema = pq.read_schema ('plain.parquet') print (schema) df = pd.read_parquet (' plain.parquet') print (df.to_json ()) schema = pq. Read_schema ('plain.parquet') print (schema) df = pd. Read_parquet ('plain.parquet') print (df. To_json ()
The output is:
Schema = pq.read_schema ('plain.parquet') print (schema) df = pd.read_parquet (' plain.parquet') print (df.to_json ()) schema = pq. Read_schema ('plain.parquet') print (schema) df = pd. Read_parquet ('plain.parquet') print (df. To_json () II. With nested field definitions
The following Schema definition adds a nested object, divides the email_address and post_address,Schema definitions under address, and the code to generate the Parquet file is as follows
Import pandas as pdimport pyarrow as paimport pyarrow.parquet as pq# internal field address_fields = [('email_address', pa.string ()), (' post_address', pa.string ()),] # define Parquet Schema Address nested address_fieldsschema = pa.schema (j) # prepare data ids = pa.array ([1,2], type = pa.int32 ()) addresses = pa.array ([('first@example.com',' city1'), ('second@example.com',' city2')], pa.struct (address_fields)) # generate Parquet data batch = pa.RecordBatch.from_arrays ([ids, addresses] Schema = schema) table = pa.Table.from_batches ([batch]) # write Parquet data to the file pq.write_table (table, 'nested.parquet') import pandas as pdimport pyarrow as paimport pyarrow. Parquet as pq# internal field address_fields = [('email_address', pa. String (), ('post_address', pa. String (),] # defines Parquet Schema,address with address_fieldsschema = pa. Schema (j) # prepare data ids = pa. Array ([1,2], type = pa. Int32 () addresses = pa. Array ([('first@example.com',' city1'), ('second@example.com',' city2')], pa. Struct (address_fields)) # generates Parquet data batch = pa. RecordBatch. From_arrays ([ids, addresses], schema = schema) table = pa. Table. From_batches ([batch]) # writes Parquet data to the file pq. Write_table (table, 'nested.parquet') 1. Verify the Parquet data file
Also use parquet-tools to view the nested.parquet file
$parquet-tools schema nested.parquet message schema {optional int32 id; optional group address {optional binary email_address (STRING); optional binary post_address (STRING) } $parquet-tools cat-- json nested.parquet {"id": 1, "address": {"email_address": "first@example.com", "post_address": "city1"} {"id": 2, "address": {"email_address": "second@example.com", "post_address": "city2"}
The Schama seen in parquet-tools does not have the word struct, but reflects the nesting relationship between its address and subordinate attributes.
What is it like to read the Schema and data of a nested.parquet file with pyarrow code
Schema = pq.read_schema ("nested.parquet") print (schema) df = pd.read_parquet ('nested.parquet') print (df.to_json ()) schema = pq. Read_schema ("nested.parquet") print (schema) df = pd. Read_parquet ('nested.parquet') print (df. To_json ()
Output:
Id: int32-- field metadata-- PARQUET:field_id: '1'address: struct child 0, email_address: string-- field metadata-- PARQUET:field_id:' 3' child 1, post_address: string-- field metadata-- PARQUET:field_id:'4'-- field metadata-- PARQUET:field_id:'2' {"id": {"0": 1, "1": 2} "address": {"0": {"email_address": "first@example.com", "post_address": "city1"}, "1": {"email_address": "second@example.com", "post_address": "city2"}} id: int32-- field metadata-- PARQUET: field_id: '1'address: struct & lt Email_address: string, post_address: string & gt Child 0, email_address: string-- field metadata-- PARQUET: field_id:'3' child 1, post_address: string-- field metadata-- PARQUET: field_id:'4'-field metadata-- PARQUET: field_id:'2' {"id": {"0": 1, "1": 2} "address": {"0": {"email_address": "first@example.com", "post_address": "city1"}, "1": {"email_address": "second@example.com", "post_address": "city2"}
The data is the same, of course, except that in the displayed Schema, address is identified as struct, which clearly indicates that it is a struct type, rather than just showing the nested hierarchy.
The above is how to use Python to define Schema and generate Parquet files. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.