What are the three open source data analysis tools of Python 07/04 Update SLTechnology News&Howtos

What are the three open source data analysis tools of Python

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

In this article Xiaobian introduces in detail "what are the three open source data analysis tools of Python" with detailed contents, clear steps and proper handling of details. I hope that this article "what are the three open source data analysis tools of Python" can help you solve your doubts.

In the field of large databases, Python is the most commonly used programming language, so it is necessary to understand its related data analysis tools. If you are running Python in your own environment using virtualenv, pyenv, or other variants, try the three open source tools recommended in this article.

(note: the example in this article uses IPython. If you are interested, please make sure it is installed.)

$mkdir python-big-data$ cd python-big-data$ virtualenv.. / venvs/python-big-data$ source.. / venvs/python-big-data/bin/activate$ pip install ipython$ pip install pandas$ pip install pyspark$ pip install scikit-learn$ pip install scipy

The sample data selected in this article is the actual production log data obtained from a website in recent days, from a technical point of view, these data can not be counted as big data, because its size is only about 2Mb, but it is enough for demonstration.

If you want to get these sample data, you can download it from the author's public GitHub repository using git: admintome / access-log-data

$git clone https://github.com/admintome/access-log-data.git

The data is a simple CSV file, so each line represents a separate log, with fields separated by commas:

172.68.133.49-[01/Aug/2018:17:10:15 + 0000] "GET / wp-content/uploads/2018/07/spark-mesos-job-complete-1024x634.png HTTP/1.0" 151587 "https://dzone.com/"" Mozilla/5.0 (Macintosh) Intel Mac OS X 10 / 12 / 6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36 "'

The following is the log row structure:

Because of the uncertainty about the complexity of the operations that can be performed by the data, this paper focuses on two operations: loading data and obtaining data samples to explain the three tools.

1 、 Python Pandas

The first tool we discussed was Python Pandas. As stated on its website, Pandas is an open source Python data analysis library. Originally developed by AQR Capital Management in April 2008 and open source at the end of 2009, it is currently being developed and maintained by the PyData development team focused on Python package development and is part of the PyData project. Pandas was originally developed as a financial data analysis tool, so pandas provides a good support for time series analysis.

First, start IPython and do something with the sample data. (because pandas is a third-party library of python, you need to install it before using it. Using pip install pandas directly will automatically install pandas and related components.)

After about a second, we will receive the following reply:

[6844 rows x 4 columns] In [3]:

As you can see, we have about 7000 rows of data from which it finds four columns that match the pattern above.

Pandas automatically creates a DataFrame object that represents a CSV file, and the DataFrame data in Pandas can be stored either in the SQL database or directly in the CSV file. Next we use the head () function to import the data sample.

In [11]: df.head () Out [11]: datetime source type log0 2018-08-01 17:10 www2 www_access 172.68.133.49-- [01/Aug/2018:17:10:15 + 0000]... 1 2018-08-01 17:10 www2 www_access 162.158.255.185-[01/Aug / 2018 www2 www_access 17 www2 www_access 10 15 + 000..2 2018-08-01 17:10 www2 www_access 108.162.238.234-- [01/Aug/2018:17:10:22 + 000.3 2018-08-01 17:10 www2 www_access 172.68.47.211-- [01/Aug/2018:17:10:50 + 0000]. 4 2018-08-01 1711 www2 www_access 141.101.96.28-- [01/Aug/2018:17:11:11 + 0000]...

2 、 PySpark

The second tool we discuss is PySpark, which comes from the big data analysis library of the Apache Spark project.

PySpark provides many features for analyzing big data in Python. It comes with shell that users can run from the command line.

$pyspark

This loads the pyspark shell:

(python-big-data) [email protected]: ~ / Development/access-log-data$ pyspark Python 3.6.5 (default, Apr 1 2018, 05:46:30) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. 2018-08-03 18:13:38 WARN Utils:66-Your hostname, admintome resolves to a loopback address: 127.0.1.1; using 192.168.1.153 instead (on interface enp0s3) 2018-08-03 18:13:38 WARN Utils:66-Set SPARK_LOCAL_IP if you need to bind to another address 2018-08-03 18:13:39 WARN NativeCodeLoader:62-Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel (newLevel). For SparkR, use setLogLevel (newLevel). Welcome to _ / _ / / _\ / _ _\ / _ _ / `/ _ _ / / _ _ /. _ _ / _ _ / / _ _ /\ _ version 2.3.1 / _ / Using Python version 3.6.5 (default, Apr 1 2018 05:46:30) SparkSession available as' spark'. > > >

When you start shell, you will get a Web GUI to check your work status, just browse to http:// localhost:4040 to get PySpark Web GUI.

Let's use PySpark Shell to load the sample data:

Dataframe = spark.read.format ("csv"). Option ("header", "false"). Option ("mode", "DROPMALFORMED"). Option ("quote", "'"). Load ("access_logs.csv") dataframe.show ()

PySpark provides examples of the DataFrame that you have created:

> dataframe2.show () +-+ | _ c0 | _ c1 | _ c2 | _ c3 | +-+-- -+ | 2018-08-01 17:10 | www2 | www_access | 172.68.133.49--. | | 2018-08-01 17:10 | www2 | www_access | 162.158.255.185 -. | | 2018-08-01 17:10 | www2 | www_access | 108.162.238.234 -. | | 2018-08-01 17:10 | www2 | www_access | 172.68.47.211 -. | | 2018-08-01 17:11 | www2 | www_access | 141.101.96.28-| | | 2018-08-01 17:11 | www2 | www_access | 141.101.96.28--. | | 2018-08-01 17:11 | www2 | www_access | 162.158.50.89--. | | 2018-08-01 17:12 | www2 | www_access | 192.168.1.7-- [. | | 2018-08-01 17:12 | www2 | www_access | 172.68.47.151--. | | 2018-08-01 17:12 | www2 | 192.168.1.7-[. | | 2018 | -08-01 17:12 | www2 | www_access | 141.101.76.83--. | | 2018-08-01 17:14 | www2 | www_access | 172.68.218.41--. | | 2018-08-01 17:14 | www2 | www_access | 172.68.218.47-- | | 2018-08-01 17:14 | www2 | www_access | 172.69.70.72 -. | | 2018-08-01 17:15 | www2 | www_access | 172.68.63.24 -. | | 2018-08-01 17:18 | www2 | www_ | Access | 192.168.1.7-- [. | | 2018-08-01 17:18 | www2 | www_access | 141.101.99.138 -. | | 2018-08-01 17:19 | www2 | www_access | 192.168.1.7-- [. | | 2018-08-01 17:19 | www2 | www_access | 162.158.89.74--. | 2018-08-01 17:19 | www2 | www_access | 172.68.54.35 -. | +-| -+-+ only showing top 20 rows

Once again, we see that there are four columns in DataFrame that match our schema, and DataFrame here can be thought of as a database table or an Excel spreadsheet.

3 、 Python SciKit-Learn

Any discussion of big data will lead to a discussion of machine learning, but fortunately, Python developers have many options to use machine learning algorithms.

Without a detailed introduction to machine learning, we need to get some data to perform machine learning, and the sample data I provide in this article does not work because it is not data of a numeric type. We need to manipulate the data and render it in a digital format, which is beyond the scope of this article. For example, we can map logs by time to get DataFrame with two columns: the number of logs in a minute and the current time:

+-+ | 2018-08-01 17:10 | 4 | +-+ | 2018-08-01 17:11 | 1 | +-+

With this form of data, we can execute machine learning algorithms to predict the number of visitors we may get in the future. SciKit-Learn comes with some sample data sets, and we can load some sample data to see how it works.

In [1]: from sklearn import datasetsIn [2]: iris = datasets.load_iris () In [3]: digits = datasets.load_digits () In [4]: print (digits.data) [[0. 0. 5.... 0. 0. 0.] [0. 0. 0. ... 10. 0. 0.] [0. 0. 0. ... 16. 9. 0.]... [0. 0. 1.... 6. 0. 0.] [0. 0. 2.... 12. 0. 0.] [0. 0. 10.... 12. 1. 0.]]

This will load two algorithms for machine learning classification to classify data.

After reading this, the article "what are the three open source data analysis tools of Python" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it to understand it. If you want to know more about related articles, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.