In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Importing Excel files into relational database is a common thing in data analysis business, but the format of many Excel files is irregular, so it is necessary to structure the data in advance and then use SQL statements to write them into the database. In general, the structured workload will be relatively large, and it is difficult to generalize, so it is necessary to analyze the file format before development.
The SPL language of the aggregator is an efficient and flexible tool, which can easily read excel data, then structure it into an "ordinal table" and import it into the database. After using the SPL language, the structured storage of Excel data, which used to require thousands of lines of code, can now be done with less than 10 lines of code, or even 2 or 3 lines of code in a simple case. Is it really that amazing? Let's take a look at the magic.
Extracting complex Excel data with 10 lines of code
Next we will discuss how to use the aggregator to structure Excel data on a case-by-case basis. For the functions used in this article, please refer to the aggregator document "function reference".
1. Ordinary line type
Look at the simplest case first: as shown in the following figure, the first row in the Excel file is the column header, and starting with the second row, each row is a data record.
It is also very simple for the aggregator to process such files:
A
one
= file ("student transcript .xlsx") .xlsimport@t ()
two
= connect ("demo")
three
= A2.update (A1 ~ ~ xscj)
A1 opens the "student score sheet .xlsx" file and imports it into the sequence table. The option @ t indicates that the first row of the file is a column header.
A2 connects to demo database
A3 saves the order table in A1 to the xscj table in the demo database. Because the column names in the table are the same as the field names in the order table, you only need to specify the data table name. For more details on the use of the update function, please refer to the function documentation.
The sequence table obtained in A1 is shown in the following figure:
The A2 and A3 two-step methods of connecting to the database and storing the order table in the database are common, so these two steps will not be written in the following examples, but will only focus on how to structure the data in Excel into sequence tables.
two。 Multi-row header row type
Most of the time, the Excel file is not as simple as the above example, and the header is often complex, including table name, project name, page number, filler, date, and so on. Like this:
For this kind of table, we have to skip the header and start reading directly from the data row.
A
one
= file ("list of measures and valuation .xlsx") .xlsimport (; 1pc5)
two
= A1.rename (# 1: serial number, # 2: project code, # 3: project name, # 4: unit of measurement, # 5: quantity, # 6: unit price, # 7: consideration)
A1 opens the file and imports the data sequence table. The parameter "1 sheet 5" indicates that the first file is read from line 5 to the end of the file.
A2 changes the column name of the sequence table read in A1 to "serial number, project code, project name, unit of measurement, quantity, unit price, consideration price", that is, the column name of the data table to be deposited.
The sequence table in A2 after operation is as follows:
3. Free form
Sometimes the data in the Excel file is not a grid-formatted rule table, but a free format in which the field name is followed by the field value, such as the employee information table in the following figure:
Each employee information occupies nine lines and is arranged down in turn. How should such documents be structured? Please look at:
A
B
C
one
= create (employee ID, name, gender, position, birthday, telephone number, address, zip code)
two
= file ("employee information table .xlsx") .xlsopen ()
three
[C,C,F,C,C,D,C,C]
[1,2,2,3,4,5,7,8]
four
For
= A3. (~ / B3 (#)). (eval ($[A2.xlscell (] / ~ / ")
five
If len (B4 (1)) = = 0
Break
six
> A1.record (B4)
seven
> B3=B3. (~ + 9)
A1 create an empty order table with the column "employee ID, name, gender, position, birthday, telephone number, address, zip code"
A2 Open Excel data file
A3 defines the cell number sequence in which the employee information is located
B3 defines the row number sequence of the unit in which the employee information is located
A4 uses for loop to read each employee information
B4A3. (~ / B3 (#)) first calculate the current employee cell number sequence, and then read out these cell values to form the employee information sequence. In the first cycle, it is [C1rect, C2rect, F2, C3, D4, D5, C7, C8], and in the second cycle, it is [C10, C11, F11, C12, C13, D14, C16, C17]. Add 9 to each line number. $[A2.xlscell (] and "A2.xlscell (") are the same, both represent a string, its advantage is that when writing a program in IDE, if the number of the A2 cell changes, the A2 in the $[A2.xlscell (]) will automatically change, such as inserting a line before A2, the expression will become $[A3.xlscell (]), and the quotation marks will not automatically change.
B5 determines whether the employee ID value is empty. If it is empty, it exits the loop and ends the operation.
B6 stores a piece of employee information at the end of the A1 sequence table
B7 adds 9 to the line number sequence of employee information to read the next piece of employee information.
The A1 sequence table obtained after operation is as follows:
4. Crosstab
There is also cross-table data in Excel, as shown in the following figure:
The data structure program is as follows:
A
one
= file ("crosstab .xlsx") .xlsimport@t (; 1pr 2)
two
= A1.rename (# 1: shipper)
three
= A2.pivot@r (shipper; shipper region, order quantity)
A1 opens the file and imports the data sequence table. The parameter "1 sheet 2" indicates that the first file is read from line 2 to the end of the file. The option @ t indicates that the starting row is a column header.
A2 because the first cell in the second row is a picture, the data read is null, and the first column has no column header, so change the name of the first column to the shipper.
A3 converts the order table data into rows and rows according to the shippers. The option @ r converts the column data into row data. After the conversion, the new column names are "shipper area" and "order quantity" respectively.
The A3 sequence table obtained after operation is as follows:
5. Master and child table
In the employee information registration form shown in the following figure, there is not only the employee's own information, but also his family member information. Each sheet holds information about one employee, so there are as many sheet as there are employees.
For the data of this master-child table structure, you need to create two ordinal tables to store the data of the master table and the child table respectively. The aggregator program is as follows:
A
B
C
one
= create (× ×, name, gender, date of birth, nationality, mobile phone number, department, family address, marital status, entry time)
two
= create (× × number, name, relationship, work unit, contact number)
three
[B4,B3,D3,F3,H3,F4,H4,B5,F5,H5]
four
= file ("employee information table .xlsx") .xlsopen ()
five
For A4
six
= A3. (eval ($[A4.xlscell (] / ~ / ",\" / A5. Page name / "\"))
> A1.record (B6)
seven
= A4.xlsimport@t (family member, name, relationship, work unit, contact number; A5. Page name, 6)
eight
= B7.rename (family member: × × number)
> B8.run (× × number = B6 (1))
nine
> A2.insert@r (0:B8)
A1 creates an empty table with the column name "× ×, name, gender, date of birth, nationality, mobile phone number, department, family address, marital status, entry time", which is used to save the employee information of the master table.
A2 creates an empty table with the column name "× × number, name, relationship, work unit, contact number", which is used to save the family member information of the employees in the child table.
A3 defines the cell sequence in which the employee information of the main table resides
A4 Open Excel data file
A5 reads each sheet data of Excel file in a loop
B6 reads employee information sequence
C6 saves the employee information read by B6 to sequence table A1
B7 reads the employee's family member information from line 6 and reads only the specified 5 columns of "family member, name, relationship, work unit, contact number".
B8 renames the family member column of the B7 order table to a × × number
C8 assigns the × × column of the B8 ordered table to the × × number in the employee information.
B9 saves the information of employee family members in B8 to sequence table A2
After the program runs, the sequence table A1 is shown in the following figure:
Table A2 is shown in the following figure:
The above situations basically list the common Excel data formats, and if you encounter more complex files, you can flexibly use the techniques in the examples to deal with them. To sum up, the aggregator provides a very flexible function to locate and read data in excel files, not only to read grid data into slices, but also to accurately locate cells for reading. Combined with the unique "sequence table" object, the structured storage of Excel data, which used to require thousands of lines of code, can now be done with less than 10 lines of code, or even 2 or 3 lines of code in a simple case.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.