How to improve the reading speed of super-large data by using Pandas and SQLite 07/12 Update SLTechnology News&Howtos

How to improve the reading speed of super-large data by using Pandas and SQLite

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use Pandas and SQLite to improve the reading speed of super-large data, the article is very detailed, has a certain reference value, interested friends must read it!

The first method: dividing into blocks

Imagine a specific example: you are going to run for an election, and to that end, you find a CSV file that contains information about each voter in your city. So you send people to visit door to door and invite all registered voters on a particular street to vote.

Now, Pandas has an index in its DataFrame object, but you have to read the data into memory, but the CSV file is too large to hold in memory, so you think you can load only the records you care about.

This is the first way to divide into blocks.

Import pandas as pd def get_voters_on_street (name): return pd.concat (df [df ["street"] = = name] for df in pd.read_csv ("voters.csv", chunksize=1000))

The CSV file is loaded in chunks and filtered by street name, and the record is then integrated.

Although loaded line by line, the focus is on a relatively small subset, so there is some overhead. For example, for a dataset with only 70k, 574ms is required to execute the above function on my computer. There are 4.6 million registered voters in New York City in 2018. It takes 30s to search from street to street.

If we only do it once, it doesn't matter, if we need to do it over and over again, this method is not the best.

Create an index object

An index is a summary, and there is a saying that if you care about it, you can find a lot of data here. In our example, we want to create an index based on the street name, so we can quickly load the street where the voters are located.

If you are worried that the index data will also be out of memory, then the database can be used as a container to store them, such as PostgreSQL, MySQL and so on. Oh, you don't like installing and maintaining those pesky services, well, SQLite came into being.

SQLite is a fully functional relational database that can run like other databases, but does not require a server. Pyhton supports this kind of database by default. SQLite saves the data in a separate file, and you have to manage a SQLite data file instead of an CSV file.

Using SQLite to store data

Here's how to manipulate SQLite with Pandas:

1. Load data into SQLite and create an index

The SQLite database can save multiple tables. First, load the data from the voters.csv file into SQLite and save it as a voters.sqlite file. In this file, we create a table called voters.

Next, create an index of the street in SQLite.

You only need to do the following:

Import sqlite3 # Create a new database file: db = sqlite3.connect ("voters.sqlite") # Load the CSV in chunks: for c in pd.read_csv ("voters.csv", chunksize=1000): # Append all rows to a new database table, which # we name 'voters': c.to_sql ("voters", db, if_exists= "append") # Add an index on the' street' column: db.execute ("CREATE INDEX street ON voters (street)") db.close ()

Although we only create a single index, we can also create other indexes on other columns or multiple columns, allowing us to use these columns to quickly search the database.

two。 Rewrite query function

Now that all the data has been loaded into SQLite, we can retrieve it by street.

Def get_voters_for_street (street_name): conn = sqlite3.connect ("voters.sqlite") Q = "SELECT * FROM voters WHERE street =?" Values = (street_name,) return pd.read_sql_query (Q, conn, values)

To execute the above function, SQLite loads only the rows that match the query and saves them as DataFrame objects through Pandas.

More than 50 times acceleration

That CSV file provides 70000 lines of records, which used to cost 574ms, but now only 10ms is used.

Speed up by more than 50 times, because we only need to load the lines we care about, not every line in the CSV file.

The above is all the contents of the article "how to use Pandas and SQLite to improve the reading speed of super-large data". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.