In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail the usefulness of the data analysis toolkit Datatable in r language. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.
Preface
Data.table is a very general-purpose and high-performance package in R, which is easy to use, easy to use and fast. It is very popular in the R language community, with more than 400000 downloads per month. It is used by nearly 650 CRAN and Bioconductor packages. If you are a user of R, you may have used the data.table package.
For Python users, there is also a datatable package that focuses on big data support, high-performance memory / out-of-memory datasets, and multithreaded algorithms. To some extent, datatable can be called data.table in Python.
Introduction to Datatable
In order to build the model more accurately, machine learning applications usually have to deal with a large amount of data and generate a variety of features, which has become necessary. Python's datatable module provides good support for solving this problem, performing big data operations on a single-node machine at the maximum possible speed (up to 100GB). The development of the datatable package is sponsored by H2O.ai, and its first user is Driverless.ai.
2.1 installation
Mac OS system
Linux system
The installation process needs to be implemented through binary distribution
Unfortunately, the datatable package does not work on Windows systems yet, but Python officials are also working hard to increase their support for Windows. For more information, see the description of Build instructions.
Https://datatable.readthedocs.io/en/latest/install.html
2.2 data reading
The dataset used here is the Lending Club Loan Data dataset from the Kaggle competition, which contains the complete loan data of all lenders for the period 2007-2015, that is, the current loan status (current, delayed, full payment, etc.) and the latest payment information. The entire file contains 2.26 million rows and 145 columns of data, and the size of the data is ideal for demonstrating the functionality of the datatable package.
Dataset:
First load the data into the Frame object, and the basic unit of analysis for datatable is Frame, which is the same as the concept of Pandas DataFrame or SQL table: that is, the data is displayed in a two-dimensional array of rows and columns.
Use datatable to read data
This data set has 2.26 million rows, 145 columns and nearly 1.2 gigabytes of data. It only takes 2.54 seconds to read through datatable.
As shown above, fread () is a powerful and fast function that automatically detects and parses most parameters in text files, including .zip files, URL data, Excel files, and so on. In addition, the datatable parser has the following functions:
Can automatically detect delimiters, headings, column types, reference rules, etc.
Ability to read data from a variety of files, including files, URL,shell, original text, archives and glob, etc.
Provides multi-threaded file reading function for maximum speed.
Include a progress indicator when reading large files.
RFC4180 compatible and incompatible files can be read.
Use pandas to read data
!!! Note: due to the large amount of data, reading data using pandas will often hang up the service, so you can use a dataset with a smaller amount of data to test
As can be seen from this, the results show that the performance of datatable packets is significantly better than that of Pandas,Pandas when reading large data. It takes nearly 30 seconds to read the data, while datatable only takes more than 2 seconds.
2.3frame conversion (Frame Conversion)
For a frame that currently exists, you can convert it to a Numpy or Pandas dataframe, as follows:
Next, convert the data frame read by datatable to Pandas dataframe form and compare the time required, as follows:
Due to the large amount of data in the Lending Club Loan Data dataset and the use of to_padnas operations, the jupyte service is easy to hang up, so use a smaller dataset for testing.
Read the data through datatable and convert it to DataFrame array, a total of 2.62ms.
Reading data through pandas alone requires a total of 14.4ms.
It seems to take less time to read the file as a datatable frame and then convert it to Pandas dataframe than to read the Pandas dataframe directly. Therefore, it is a good idea to import a large data file through the datatable package and convert it to Pandas dataframe.
Basic properties of 2.4 Fram
Here are some of the basic properties of frame in datatable, which are similar to some functions of dataframe in Pandas.
You can also print out the first n lines of output by using the head command, as follows:
Note: color is used to refer to the type of data, where red represents a string, green represents an integer, and blue represents a floating point.
2.5 Statistical Summary
Summarizing and calculating statistics for data is a memory-intensive process in Pandas, but it is convenient in the datatable package. Use the datatable package to calculate statistics for each of the following columns as shown below:
The following uses datatable and Pandas to calculate the average of each column of data and compare the difference in running time between the two.
Datatable read
Pandas read
An exception with a memory error is thrown when calculating with Pandas.
Data operation
Like dataframe, datatable is a cylindrical data structure. In datatable, the main tool for all of these operations is square brackets, inspired by traditional matrix indexes, but it contains more functionality. For example, matrix indexes, Cpact Category quotation, RandasPandasNumpy, all use the same mathematical representation of DT [iQuery j]. Let's take a look at how to use datatable to do some common data processing.
Select a subset of rows / columns
The following code filters all rows and funded_amnt columns from the entire dataset:
Show how to select the first five rows and three columns of data in the dataset, as follows:
Frame sorting
Datatable sorting
In datatable, frames are sorted by specific columns, as follows:
Pandas sorting
You can see that there is a significant difference in sorting time between the two packages.
Delete rows / columns
Here's how to delete the data in the member_id column:
Grouping (GroupBy)
Like Pandas, datatable also has a GroupBy operation. Let's take a look at how to get the mean of the funded_amout column by grouping grade in datatable and Pandas:
Datatable grouping
Pandas grouping
What does .f stand for
In datatable, f stands for frame_proxy, which provides a simple way to reference the frame you are currently working on. In the above example, dt.f represents only dt_df.
Filter Row
In datatable, the syntax for filtering rows is very similar to that of GroupBy. Here's how to filter out values greater than funding_amnt in loan_amnt, as shown below.
Save Fram
In datatable, you can also save the contents of a frame by writing it to a csv file for later use. As follows:
For more features on data manipulation, see the documentation for the datatable package
Address: https://datatable.readthedocs.io/en/latest/using-datatable.html
This is the end of the article on "what is the use of Datatable in r language". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.