In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article introduces the knowledge about "Pandas 'easy way to process super-large-scale data". In the actual case operation process, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!
Working with large data sets is often tricky, especially if memory is unable to fully load the data. In resource-constrained situations, Python Pandas provides features that reduce the memory footprint of loading datasets. Available techniques include compression, indexing, and data chunking.
There are a number of issues that need to be addressed in this process, one of which is the excessive amount of data. Project execution can cause problems if the amount of data exceeds the capacity of the local memory.
What are the solutions to this?
There are several ways to solve the problem of excessive data volume. They either consume time or require increased investment.
possible solutions
Investment solution: Buy a new computer capable of processing entire data sets, with a stronger CPU and more memory. Or rent cloud services or virtual memory to create clusters to handle workloads.
Time-consuming solution: If memory is insufficient to process the entire data set, and the hard disk capacity is much larger than memory, consider using the hard disk to store data. But using hard drives to manage data can significantly reduce processing performance, and even SSDs are much slower than memory.
Both solutions work as long as resources permit. These two approaches are the simplest and most straightforward solutions if the project is well funded or at any cost in time.
But what if that's not the case? Maybe you have limited funds, or your data set is too large, and loading from disk will increase processing time by 5 - 6 times or more. Is there a big data solution that doesn't require extra capital or time?
This question is exactly what I want.
There are a variety of techniques available for big data processing that do not require additional investment and do not take a lot of loading time. This article describes three of these techniques for processing large-scale datasets using Pandas.
compression
The first technique is data compression. Compression does not mean packing data into ZIP files, but storing data in memory in a compressed format.
In other words, data compression is a way to represent data using less memory. There are two types of data compression, lossless compression and lossy compression. These two types affect only the loading of data, not the processing code.
lossless compression
Lossless compression does not cause any loss of data, i.e. the original data and compressed data remain semantically unchanged. Lossless compression can be performed in three ways. In the following, the new crown virus case dataset by state in the United States will be used to introduce it in turn.
Load specific data columns
The dataset used in the example has the following structure:
import pandas as pd data = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv") data.sample(10)
Loading the entire dataset takes up 111MB of memory!
Why load the entire dataset if we only need two columns from the dataset, state name and number of cases? Loading the required two columns of data takes only 36MB, reducing memory usage by 32%.
The code to load the required data column using Pandas is as follows:
The code snippets used in this section are as follows:
#Load required software library Import pandas as pd #Dataset csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv" #Load entire dataset data = pd.read_csv(csv) data.info (verbose=False, memory_usage="deep") #Create data subset df = data[["county", "cases"] df.info (verbose=False, memory_usage="deep") #Two columns of data required for acceleration df_2col = pd.read_csv(csv , usages =["county", "cases"]) df_2col.info(verbose=False, memory_usage="deep")
Code Address:
https://gist.github.com/SaraM92/3ba6cac1801b20f6de1ef3cc4a18c843#file-column_selecting-py
Operation Data Type
Another way to reduce data memory usage is to truncate numeric items. For example, loading CSV into DataFrame, if the file contains values, then a value needs 64 bytes to store. However, you can save memory by truncating values using int format.
int8 stores values in the range-128 to 127;
int16 stores values in the range-32768 to 32767;
int64 stores values ranging from-9223372036854775808 to 9223372036854775807.
If the predeterminable value is not greater than 32767, then you can use int16 or int32, which reduces the memory footprint of this column by 75%.
Assuming that the number of cases per state does not exceed 32767 (although this is not the case in reality), the column can be truncated as int16 rather than int64.
sparse column
If a dataset has a large number of NaN nulls in one or more columns, sparse column representation can be used to reduce memory usage so nulls do not consume memory.
Assuming there are some nulls in the state column, we need to skip all rows that contain nulls. This requirement can be easily implemented using Pandas.sparse.
lossy compression
If lossless compression does not meet the requirements and further compression is required, what should be done? Lossy compression can be used to trade off memory usage at the expense of 100 percent accuracy of data.
Lossy compression can be performed in two ways: modifying the values and sampling.
Change the value: Sometimes it is not necessary to preserve the full precision of the value, in which case int64 can be truncated to int32 or even int16.
Sampling: If you need to confirm that some states have a higher number of new crown cases than others, you can sample data from some states to see which states have more cases. This is a lossy compression because not all rows of data are considered.
Second technique: chunking
Another way to deal with large data sets is data chunking. The large-scale data is divided into many small blocks, and then each block is processed separately. After processing all the blocks, the results can be compared and a final conclusion can be drawn.
The dataset used in this article contains 1923 rows of data.
Assuming we need to find the states with the most cases, we can slice the dataset into 100 rows, process each chunk separately, and extract the maximum from these small results.
The code snippet for this section is as follows:
#Import the required software library import pandas as pd #Dataset csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv" #Loop through each data block to get the maximum value in each data block result = {} for chunk in pd.read_csv(csv, chunksize=100): max_case = chunk["cases"].max() max_case_county = chunk.loc[chunk[ cases ] == max_case, county ].iloc[0] result[max_case_county] = max_case #gives the result print(max(result, key=result.get) , result[max(result, key = result.get)])
Code Address:
https://gist.github.com/SaraM92/808ed30694601e5eada5e283b2275ed7#file-chuncking-py
Third method: indexing
Data chunking is ideal for situations where the dataset is loaded only once. But if you need to load the dataset multiple times, you can use indexing techniques.
An index can be understood as a table of contents for a book. You don't need to read the entire book to get the information you need.
For example, chunking techniques are well suited to capture the number of cases in a given state. This can be done by writing the following simple function:
Indexing vs blocking
Chunking requires reading all data, while indexing requires reading only part of the data.
The above function loads all the rows in each chunk, but we only care about one of the states, which leads to a lot of overhead. Pandas database operations can be used, for example, simply by using SQLite databases.
First, you need to load DataFrame into SQLite database, code is as follows:
import sqlite3 csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv" #Create new database file db = sqlite3.connect("cases.sqlite") #Load CSV file by block for c in pd.read_csv(csv, chunksize=100): #Load all data rows into new database tables c.to_sql("cases," db, if_exists="append") #Add index db.execute("INDEX CREATE state ON cases(state)") db.close()
Code Address:
https://gist.github.com/SaraM92/5b445d5b56be2d349cdfa988204ff5f3#file-load_into_db-py
To use the database, you need to rewrite the get_state_info function below.
This reduces memory usage by 50%.
"Pandas easy way to deal with super-large-scale data is what" content is introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.