In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces you how to use Python to insert data into the database in batches, the content is very detailed, interested friends can refer to, hope to be helpful to you.
I am a pythoner struggling at the bottom of the programming chain. I not only have to deal with data, but also maintain friendly "communication" with erp system and web website. I will share those things in my work from time to time, including programming tips that I think are worth recording, as well as problems and solutions, as well as reading source code, etc., and may also have life insights in programming, not to mention. I'm going to reconstruct my program.
Based on python, we use pandas, pymysql and other tripartite libraries to insert data into the database in batches. On the one hand, we provide answers that have been blinded by many blind reprints on the Internet (because I am, too). On the other hand, we also take notes, which are easy to consult later.
Demand reason
Recently, we are dealing with a requirement for bulk inserting data into the database, which is described as follows
The original program is based on the stored procedure of sql to update and modify the data. due to the large amount of data, the pressure on the database is too great, so it is necessary to reconstruct the program to calculate and process the data by reading files with python, so as to reduce the pressure of this part. finally, only call the lambda service of aws to update the calculation results to the database, which reduces the great pressure and reduces the cost. The database involved is mainly insert and update operation.
Version library information
Based on linux system.
Tripartite Library > > pandas 1.0.5, pymysql 0.9.3
Python version > 3.7
Standard Library > > os
Logical carding
In fact, in the last step, the file data to be written to the database is stored in memory. Because the calculation after reading the file is carried out in memory, then there is no need to write the results to the local, then read, and then write to the database, which will affect the efficiency of the program. The logic is as follows
Read a file
File splicing and calculation to generate a new df
Initialize the connection to the database
Convert the data required by df into tuple data (depending on how the interface of the tripartite library of the database supports batch operations)
Write data to the database
Just check the contents of the database
Step by step implementation and analysis
Read a file
Give the file path, and then read the file. Emphasize the points that need to be paid attention to.
Absolute path: this is the simplest, just give the path string directly, but once the folder directory structure changes, it needs to be changed frequently.
Relative path: I generally like to locate the location of the current script in the script first, and then find it through the relative path, so that as long as the directory structure of the whole package remains unchanged, there is no need to change it, even if the deployment is launched directly according to the location of the package, it is very convenient
Pandas reads all numbers as float by default, so type conversion is performed for fields that appear to be numbers but actually need to be used as strings
Import pandas as pd import numpy as np # current script location current_folder_path = os.path.dirname (_ _ file__) # the location of your file your_file_path2 = os.path.join (current_folder_path, "file name 1") your_file_path3 = os.path.join (current_folder_path, "file name 2") # I'm taking reading csv files as an example Delimiter is the separator between columns df1 = pd.read_csv (your_file_path2, dtype= {"column1": str, "column2": str}, delimiter= "/ t") df2 = pd.read_csv (your_file_path3, dtype= {"column1": str, "column2": str}, delimiter= "/ t")
Splicing and calculation of documents
The splicing of the file is mainly the use of merge and concat syntax, emphasizing the small knowledge points.
Merge syntax mainly corresponds to inner connection, outer connection, left connection and right connection of sql language.
Concat is mainly used to simply stitch together df of the same structure (that is, to increase the total number of rows of the list).
# here is an example of a left link. Suppose only two files are concatenated ret_df = pd.merge (df1, df2, left_on= ["column_name"], right_on= ["column_name"], how= "left")
Initialize the connection
Import the three-party library pymysql and initialize the connection
# pymysql API to get link def mysql_conn (host, user, password, db, port=3306, charset= "utf8"): # passed parameter version conn = pymysql.connect (host=host, user=user, password=password, database=db, port=port, charset=charset) return conn
The corresponding interface translates data
Data insertion should consider writing a transaction, because if it fails, it will have no impact on the database.
Construct a data format that conforms to the corresponding interface. Through query, pymysql has two interfaces that can execute statements.
Execute (single insert statement)
The interface that executes a single statement
Something like this: Insert into table_name (column) values (value)
Executemany (bulk insert statement)
Interface that executes multiple statements
Similar to this: Insert into table_name (column1, column2, column3) values (value1, value2, value3)
The specific implementation is as follows
# first create cursor to operate the conn interface conn = mysql_conn ("your db host", "your username", "your password") "db name") cursor = conn.cursor () # Open transaction conn.begin () # process of constructing batch data # # first construct columns that are needed or match the database columns = list (df.columns) # you can delete unwanted columns or column names columns that are not available in the database .remove ("column name") # restructure df Using the columns above, you need to make sure that all your columns are ready to write new_df = DF [columns] .copy () # to the database to construct columns that conform to the sql statement Because sql statements are comma-separated, (this corresponds to the (column1, column2, column3) of the above sql statement) columns =', '.join (list (new_df.columns)) # constructs the data corresponding to each column, corresponding to the ((value1, value2, value3)) data_list = [tuple (I) for i in gdsord_df.values] # each tuple is a piece of data How much tuple data is generated based on the number of df rows # calculate how many values a row needs to use string placeholder s_count = len (data_list [0]) * "% s,"# construct sql statement insert_sql =" insert into "+" database table name "+" ("+ columns +") values ("+ s_count [:-1] +") "
Write data to the database
This is simple. Go straight to the code.
Cursor.executemany (insert_sql, data_list) conn.commit () cursor.close () conn.close ()
Check whether the database is inserted successfully
If there is no problem, you can read and write multiple files at the same time, calculate, and finally enable multithreading to write data to the database at the same time, which is very efficient!
Complete code
Import pandas as pd import numpy as np # pymysql interface def mysql_conn (host, user, password, db, port=3306, charset= "utf8"): conn = pymysql.connect (host=host, user=user, password=password, database=db, port=port, charset=charset) return conn # location of the current script current_folder_path = os.path.dirname (_ _ file__) # location of your file your_file_path2 = os.path.join (current_folder_path "name of the file 1") your_file_path3 = os.path.join (current_folder_path, "name of the file 2") # here I am taking reading the csv file as an example. Delimiter is the separator between columns that we have agreed upon internally, df1 = pd.read_csv (your_file_path2, dtype= {"column1": str, "column2": str}, delimiter= "/ t") df2 = pd.read_csv (your_file_path3, dtype= {"column1": str "column2": str}, delimiter= "/ t") # merge ret_df = pd.merge (df1, df2, left_on= ["column_name"], right_on= ["column_name"], how= "left") # first create cursor responsible for operating conn interface conn = mysql_conn ("your db host", "your username", "your password") "db name") cursor = conn.cursor () # Open transaction conn.begin () # first construct columns that are needed or match the database columns = list (df.columns) # you can delete unwanted columns or column names that are not available in the database columns.remove ("column names") # reconstruct df with the above columns At this point, you need to make sure that all your columns are ready to write new_df = DF [columns] .copy () # to construct columns that conform to the sql statement. Because sql statements are comma-separated, (this corresponds to the (column1, column2, column3) of the above sql statement) columns =', '.join (list (new_df.columns)) # constructs the data corresponding to each column, corresponding to the ((value1, value2, value3)) data_list = [tuple (I) for i in gdsord_df.values] # each tuple is a piece of data How much tuple data is generated based on the number of df rows # calculate how many values a row needs to use string placeholder s_count = len (data_list [0]) * "% s,"# construct sql statement insert_sql =" insert into "+" database table name "+" ("+ columns +") values ("+ s_count [:-1] +") "try: cursor.executemany (insert_sql Data_list) conn.commit () cursor.close () conn.close () except Exception as e: # in case of failure To rollback operation conn.rollback () cursor.close () conn.close () on how to use Python to insert data into the database in batches is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 288
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.