Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Analysis of how to load and organize python data

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

How to carry out python data loading and finishing analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

Data loading

Import text data

1. Import text format data (CSV):

Method 1: use pd.read_csv () to open the csv file by default.

Rows 9, 10, and 11 can import data in text format.

Special note: line 9 uses the condition that the running file .py can only write the file name when it needs to be in the same folder as the target file CSV. The first part of the file name ex1.CSV in lines 10 and 11 is the path to the file.

Method 2: using pd.read.table (), you need to specify what kind of delimiter text file is. Specify it with sep= "".

2. When the file does not have a title line

You can have pandas automatically assign default column names to it.

You can also define your own column names.

3. Use a column as an index, such as using a message column as an index. Specify 'message' through the index_col parameter.

4. To make multiple columns into a hierarchical index, simply pass in a list of column numbers or column names.

5, the missing value in the text processing, missing data is either no (empty string), or is represented by a tag value, by default, pandas will be identified by a set of frequent tag values, such as NA, NULL, and so on. The results are displayed in NAN.

6. Read the text file block by block

If you only want to read a few lines (avoid reading the entire file), you can make it through nrows.

7. For tables that are not separated by fixed delimiters, you can use regular expressions as delimiters for read_table.

('\ swords'is a character in a regular expression.)

Import JSON data

JSON data is one of the annotated forms of sending data between Web browsers and other applications through HTTP requests. JSON objects can be converted into Python objects through json.loads. (import json)

The corresponding json.dumps converts the Python object to JSON format.

Import EXCEL data

Get it directly using read_excel (file name path), similar to reading a file in CSV format.

Import database data

It mainly contains two kinds of database files, one is SQL relational database data, the other is non-SQL database data, namely MongoDB database files.

Database file is one of the more difficult, I did not contact the database file, no personal test, so do not paste screenshots.

Data collation

Merge dataset

1. Merging of database style

Database-style merging works the same as join in SQL databases. Merging can be done by calling the merge function.

When you do not specify which column to join, the program will automatically join by the column name of the overlapping column, that is, by the overlapping column "key" column. You can also specify a join column to connect through on.

When the column names of two objects are different, that is, when the two objects have no common column, they can also be specified separately.

Left_on refers to the column in the DataFrame on the left that is used as a join.

Right_on refers to the column in the DataFrame on the right that is used as a join.

The result obtained from the above statement is that only the data corresponding to an and b, c and d, and related data are deleted, because by default, merge does a 'inner' connection, that is, an inner connection in sql, to obtain the intersection of the two objects. There are other ways to connect: left, right, outer. Indicate with "how".

You can also merge based on multiple keys (columns) and pass in a list of column names using on.

2. Merge on index

(1) merging of general indexes

Left_index means to refer to the row index on the left as its connection key

Right_index means to refer to the row index on the right as its connection key

The above two join keys used in DataFrame are in their indexes and can be joined using Left_index=True or right_index=True or both.

(2) hierarchical index

It is the same as using on to merge based on multiple keys in a database.

3. Axial connection (merging)

Axial connection, which is connected in the axial direction by default, can also be connected horizontally through axis=1.

(1) numpy objects (arrays) can be merged with the concatenation function in numpy.

(2) for pandas objects (such as Series and DataFrame), you can merge the concat function in pandas.

4. Merge overlapping data

For two datasets with all or partial overlapping indexes, we can merge them using the where function of numpy, which is equivalent to the if-else function.

The same data is displayed for duplicate data, while a list of data is displayed for different data. You can also use combine_first 's method for merging. The principle of merging is consistent with the where function, when the same data is displayed, the same data is displayed, and the list data is displayed differently.

Reshape the dataset

1. Rotate the data

(1) reshape the index, divided into stack (to rotate the column of data into rows) and unstack (to rotate the row of data to column).

(2) rotate the 'long format' to the 'wide format'

2. Convert data

(1) data substitution, replacing one or more values with new values. Missing values or outliers are commonly used. Missing values are generally marked with NULL and NAN, and new values can be used instead of missing marker values. The method is replace.

One-to-one replacement: replace-999 with np.nan

Many-to-one replacement: replace-999 and-1000 with np.nan.

Many-to-many replacement: use np.nan instead of-999 zero instead of-1000.

It can also be replaced in the form of a dictionary.

(2) discretization or surface partition, that is, the data are grouped according to certain conditions.

A group of ages were grouped by pd.cut ().

By default, the cut group condition is open on the left and closed on the right. You can use left (right) = False to set which side is closed.

Clean up the dataset

It mainly refers to cleaning up duplicate values. Duplicate rows often occur in DataFrame. Cleaning data is mainly aimed at cleaning these duplicate rows.

Using the drop_duplicates method, you can return a DataFrame. Exe that removes duplicate rows.

By default, this method performs a duplicate cleanup operation on all columns, or it can be used to specify a specific column or columns.

By default, the above method retains the * combination of values that appear, while the input take_last=true retains * one.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report