What are the contents of Python data crawling, analysis, mining and distributed computing 07/12 Update SLTechnology News&Howtos

What are the contents of Python data crawling, analysis, mining and distributed computing

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article focuses on "what are the contents of Python data crawling, analysis, mining and distributed computing". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what are the contents of Python data crawling, analysis, mining and distributed computing?"

01 data capture

1. Background investigation

1) check robots.txt for the restrictions on crawling the site

2) pip install builtwith;pip install python-whois

2. Data capture:

1) dynamically loaded content:

Use selenium

#! / usr/bin/env python #-*-coding: utf-8-*-from selenium import webdriver from selenium.webdriver.common.keys import Keys import time import sys reload (sys) sys.setdefaultencoding ('utf8') driver = webdriver.Chrome ("/ Users/didi/Downloads/chromedriver") driver.get (' http://xxx') elem_account = driver.find_element_by_name ("UserName") elem_password = driver.find_element_by_name ("Password") elem_code = driver.find_element_by_name ("VerificationCode") elem_account.clear () elem_password.clear () elem_code.clear () elem_account.send_keys ("username") elem_password.send_keys ("pass") elem_code.send_keys ("abcd") time.sleep (10) driver.find_element_by_id ("btnSubmit"). Submit () time.sleep (5) driver.find_element_by_class_name ("txtKeyword"). Send _ keys (u "x") # simulated search driver.find_element_by_class_name ("btnSerch"). Click () #. Omit the process dw = driver.find_elements_by_xpath ('/ / li [@ class= "min"] / dl/dt/a') for item in dw: url = item.get_attribute ('href') if url: ulist.append (url) print (url + "-" + str (pnum)) print ("#")

2) statically loaded content

(1) regularization

(2) lxml

(3) bs4

#! / usr/bin/env python #-*-coding: utf-8-*-string = rascsrc = "(http://imgsrc\.baidu\.com.+?\.jpg)" pic_ext=" jpeg "'# regular expression string urls = re.findall (string) Html) import requests from lxml import etree import urllib response = requests.get (url) html = etree.HTML (requests.get (url) .content) res = html.xpath ('/ / div [@ class= "d_post_content j_d_post_content"] / img [@ class= "BDE_Image"] / @ src') # lxml import requests from bs4 import BeautifulSoup soup = BeautifulSoup (response.text, 'lxml') # parse response and create the BeautifulSoup object urls = soup.find_all (' img', 'BDE_Image')

3): anti-climbing and anti-climbing

(1): request frequency

(2): request header

(3): IP agent

4): crawler framework:

(1) Scrapy

(2): Portia

02 data analysis

1. Commonly used data analysis libraries:

NumPy: is an operation based on vectorization.

1) List = > Matrix

2) ndim: dimension; shape: number of rows and columns; size: number of elements

Scipy: it is an extension of NumPy, including advanced mathematics, signal processing, statistics and so on.

Pandas: a package for quickly building advanced data structures based on NumPy, data structures: Series and DataFrame.

1): NumPy is similar to List,Pandas and similar to Dict.

Matplotlib: drawing library.

1): is a powerful drawing tool

2): support scatter chart, line graph, bar chart, etc.

Simple example:

Pip2 install Numpy > > import numpy as np > a = np.arange (10) > an array ([0,1,2,3,4,5,6,7,8,9]) > > a * * 2 array ([0,1,4,9,16,25,36, 49,64,81]) pip2 install Scipy > > import numpy as np > > from scipy import linalg > > a = np.array ([1,2], [3]) Df = pd.DataFrame ({'A': pd.date_range ("20170802", periods=5),'B': pd.Series ([11, 22, 33, 44, 55]),'C': pd.Categorical (["t", "a", "b", "c") > > df ABC 0 2017-08-02 11 t 1 2017-08-03 22 a 2 2017-08-04 33 b 3 2017-08-05 44 c 4 2017-08-06 55 g pip2 install Matplotlib > > import matplotlib.pyplot as plt > plt.plot ([1,2,3]) [] > plt.ylabel ("didi") > > plt.show ()

2. Advanced data analysis library:

Scikit-learn: machine learning framework.

The figure can show that the data is less than 50 no: need more data, Yes use classifier, keep going

From the figure, we can see that there are four types of algorithms, classification, regression, clustering, dimensionality reduction.

KNN:

#! / usr/local/bin/python #-*-coding: utf-8-* -''predict Iris https://en.wikipedia.org/wiki/Iris_flower_data_set' # Import module from _ _ future__ import print_function from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier # create data iris = datasets.load_iris () iris_X = iris.data # length and width of calyx, Length and width of petals iris_y = iris.target # species of flowers 0 1, 2 print (iris_X) print (iris_y) print (iris.target_names) # definition Model-training Model-Forecast X_train, X_test, y_train, y_test = train_test_split (iris_X, iris_y, test_size = 0.1) # training data 10% knn = KNeighborsClassifier () # create KNN neighbor knn.fit (X_train Y_train) # training data predicts = knn.predict (X_test) # get prediction results # comparison results print ("#") print (X_test) print (predicts) print (y_test) # calculate prediction accuracy print (knn.score (X_test) Y_test) [[5.3.3 1.4 0.2] [5.3.5 1.3 0.3] [6.7 3.1 5.6 2.4] [5.8 3.9 1.2] [6.2.2 5.1.5] [6.3.4.8 1.8] [6.32.5 5.1.9] [5.3.61] . 4 0.2] [5.6 2.9 3.6 1.3] [6.9 3.2 5.7 2.3] [4.9 3. 1.4 0.2] [5.9 3. 4.2 1.5] [4.8 3. 1.4 0.1] [5.1 3.4 1.5 0.2] [4.7 3.2 1.6 0.2]] [0 0 2 1 1 2 2 0 1 2 0 1 0 0 0] [0 0 2 1 2 2 2 0 1 2 0 1 0 0 0] 0.933333333333

Linear Regression

#! / usr/local/bin/python #-*-coding: utf-8-* -''Boston House Price Trends''# Import module from _ _ future__ import print_function from sklearn import datasets from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt # create data loaded_data = datasets.load_boston () # Boston house prices data_X = loaded_data.data data_y = loaded_data.target print (data _ X) print (data_y) # definition model-training model-prediction model = LinearRegression () # linear regression model.fit (data_X Data_y) # training data print (model.predict (data_X [: 4,:]) # get the prediction result print (data_y [: 4]) # result print ("#") X, y = datasets.make_regression (n_samples=100, n_features=1, noise=10) # generate 100 samples of regression model data, each sample has a feature Gaussian noise plt.scatter (X, y) # scatter plot plt.show ()

03 data mining

1. Mining keywords:

The algorithm involved: TF-IDF

News.txt:

DiDi reached strategic cooperation with Taxify, a leading travel company in Europe and Africa, to support cross-regional transportation technology innovation.

2017-08-01 DiDi [Beijing, China / Tallinn, Estonia, August 1, 2017] DiDi today announced a strategic cooperation with Taxify, a mobile travel leader in Europe and Africa. Didi will support Taxify to carry out more in-depth market expansion and technological innovation in multiple markets through cooperation in investment and intelligent transportation technology research and development. DiDi is the world's leading mobile travel platform. Relying on artificial intelligence technology, Didi provides diversified travel services for more than 400 million users in more than 400 cities, including taxis, chauffeured cars, express cars, luxury cars and hitchhikers. While providing flexible employment and income opportunities for more than 1700 million drivers, Didi also uses artificial intelligence technology to support city managers to build integrated and sustainable intelligent transportation solutions. Founded in Estonia in 2013, Taxify is the fastest growing mobile travel company in Europe and Africa. At present, its taxi and private car sharing service network covers central cities in Europe, Africa and West Asia, and reaches 18 countries, including Hungary, Romania, Poland, Baltic Sea, South Africa, Nigeria and Kenya, with more than 2.5 million users. "Taxify provides high-quality and innovative travel services in diversified markets," said Cheng Wei, founder and CEO of DiDi. "We are all committed to using the power of mobile Internet technology to meet rapidly evolving consumer needs and to help transform and upgrade the traditional transportation industry. I believe this cooperation will contribute to the building of cross-regional smart transport links between Asian, European and African markets."

Marcus Villig, founder of Taxify and CEO, said: "Taxify will take advantage of this strategic partnership to consolidate our dominant position in the core markets of Europe and Africa. We believe Didi is the ideal partner and can help us become the most efficient travel choice in Europe and Africa."

#! / usr/local/bin/python #-*-coding: utf-8-* -''Analytical article keywords' 'import os import codecs import pandas import jieba import jieba.analyse # formatted data format tagDF = pandas.DataFrame (columns= [' filePath', 'content',' tag1', 'tag2',' tag3', 'tag4',' tag5']) try: with open ('. / houhuiyang/news.txt' 'r') as f: # load the corpus content = f.read (). Strip () tags = jieba.analyse.extract_tags (content, topK=5) # TF_IDF tagDF.Lou [len (tagDF)] = [. / news.txt ", content, tags [0], tags [1], tags [2], tags [3], tags [4] print (tagDF) except Exception, ex: print (ex)

Calculate the keywords of the article Top5: Chuxing, Didi, Taxify, Europe and Africa, traffic

2. Emotion analysis

1) the easiest way is based on the emotion dictionary.

2) what is complicated is the method based on machine learning

Pip2 install nltk > > import nltk > from nltk.corpus import stopwords # stop words > > nltk.download () # installation Corpus > t = "Didi is a travel company" > word_list = nltk.word_tokenize (t) > filtered_words = [word for word in word_list if word not in stopwords.words ('english')] [' Didi', 'travel',' company'] > nltk.download ('stopwords') # download stop words

Differences between Chinese and English NLP participle

1): heuristic Heuristic

2): machine learning / statistical methods: HMM, CRF

Processing flow: raw_text-> tokenize [pos tag]-> lemma / stemming [pos tag]-> stopwords-> word_list

04 Python distributed computing

Pip2 install mrjjob pip2 install pyspark

1) Python multithreading

2) Python multiprocess [multiprocessing]

3) Global interpreter lock GIL

4) Queue for interprocess communication

5) process pool Pool

6) higher order function of Python

Map/reduce/filter

7) MapReducer of Linux-based pipeline [cat word.log | python mapper.py | python reducer.py | sort-k 2R]

Word.log

Beijing, Chengdu, Shanghai, Beijing, Shanxi, Tianjin and Guangzhou

#! / usr/local/bin/python #-*-coding: utf-8-* -''mapper' 'import sys try: for lines in sys.stdin: line = lines.split () for word in line: if len (word.strip ()) = 0: continue count = "% s word.strip% d"% (word, 1) print (count) except IOError Ex: print (ex) #! / usr/local/bin/python #-*-coding: utf-8-* -''reducer' import sys try: word_dict = {} for lines in sys.stdin: line = lines.split (",") if len (line)! = 2: continue word_dict.setdefault (line [0] 0) word_dict [line [0]] + = int (line [1]) for key, val in word_dict.items (): stat = "s% d"% (key, val) print (stat) except IOError, ex: print (ex)

05 neural network

There are CPU/GPU versions respectively.

1) the neural network established by tensorflow is static.

2) the neural network established by pytorch http://pytorch.org/#pip-install-pytorch is dynamic [Troch is written by Lua, this is the Python version]

Simply put, the data:

A scalar (Scalar) is a quantity with only size and no direction, such as 1, 2, 3, and so on.

A Vector is a quantity with size and direction, which is actually a string of numbers, such as (1)

A matrix (Matrix) is a pile of numbers formed by the merging of several vectors into a row, such as [1mem2between3re4]

Tensor (Tensor) is a generalization of a pile of numbers arranged according to any dimension. As shown in the figure, a matrix is just a two-dimensional section under a three-dimensional tensor. To find a scalar under a three-dimensional tensor, you need the coordinates of three dimensions to locate.

TensorFlow pytorch uses tensors as a data structure to represent all data.

#-*-coding: UTF-8-*-# author houhuiyang import torch import numpy as np from torch.autograd import Variable import torch.nn.functional as F import matplotlib.pyplot as plt np_data = np.arange (6). Reshape ((2,3)) torch_data = torch.from_numpy (np_data) tensor2np = torch_data.numpy () print ("\ nnp_data", np_data, # matrix "\ ntorch_data", torch_data, # tensor "\ ntensor to numpy" Tensor2np) # data = [- 1,-2, 1 np.matmul 2, 3] data = [[1 nnumpy 2], [3 data 4]] tensor = torch.FloatTensor (data) # abs sin cos mean average matmul/mm print ("\ nnumpy", np.matmul (data, data), "\ ntorch", torch.mm (tensor, tensor)) # tensor variable tensor_v = torch.FloatTensor ([[1jue 2], [3jue 4]]) variable = Variable (tensor_v Requires_grad=True) # Computing median t_out = torch.mean (tensor_v * tensor_v) # x ^ 2 v_out = torch.mean (variable * variable) # backpropagation print (tensor_v, variable, t_out V_out) v_out.backward () # reverse transfer print (variable.grad) # gradient''y = Wx linear y = AF (Wx) nonlinear [excitation function relu/sigmoid/tanh]''x = torch.linspace (- 5prime5 200) # take 200 points from-5 to 5 x = Variable (x) x_np = x.data.numpy () y_relu = F.relu (x). Data.numpy () y_sigmoid = F.sigmoid (x). Data.numpy () y_tanh = F.tanh (x). Data.numpy () # y_softplus = F.softplus (x). Data.numpy () # probability graph plt.figure (1, figsize= (8) Plt.plot (x_np, y_relu, c = "red", label = "relu") plt.ylim (- 1,5) plt.legend (loc = "best") plt.show () # plt.subplot (222nd) plt.plot (x_np, y_sigmoid, c = "red", label = "igmoid") plt.ylim (- 0.2) 1.2) plt.legend (loc = "best") plt.show () # plt.subplot (223) plt.plot (x_np, y_tanh, c = "red", label = "subplot") plt.ylim (- 1.2,1.2) plt.legend (loc = "best") plt.show ()

Build a simple neural network

#-*-coding: UTF-8-*-# author Heart of Watch''regression Classification' 'import torch from torch.autograd import Variable import torch.nn.functional as F # excitation function import matplotlib.pyplot as plt x = torch.unsqueeze (torch.linspace (- 1,100), dim = 1) # unsqueeze one-dimensional transformation into two-dimensional y = x.pow (2) + 0.2 * torch.rand (x.size ()) x, y = Variable (x) Variable (y) # print (x) # print (y) # plt.scatter (x.data.numpy (), y.data.numpy ()) # plt.show () class Net (torch.nn.Module): # inherit Moudle def _ _ init__ (self, n_features, n_hidden, n_output) of torch: super (Net, self). _ _ init__ () # inherit torch _ init__ self.hidden = torch.nn.Linear (n_features N_hidden) # Hidden layer linear output self.predict = torch.nn.Linear (n_hidden, n_output) # output linear layer def forward (self, x): X = F.relu (self.hidden (x)) # excitation function x = self.predict (x) # output value return x net = Net (1,10,1) # input value Hidden layer 10 10 neurons, 1 output value print (net) # output neural network structure plt.ion () plt.show () # training tool optimizer = torch.optim.SGD (net.parameters (), lr = 0.5) # all values passed into net Lr is the learning rate loss_func = torch.nn.MSELoss () # mean square error print (net.parameters ()) for t in range (100): prediction = net (x) # feed net training data x Output predicted value loss = loss_func (prediction, y) # calculation error # backpropagation optimizer.zero_grad () loss.backward () optimizer.step () if t% 5 = = 0: plt.cla () plt.scatter (x.data.numpy (), y.data.numpy () plt.plot (x.data.numpy (), prediction.data.numpy (), "r -", lw = 5) plt.text (0.5,0) 'Loss=%.4f'% loss.data [0], fontdict= {' size': 20, 'color':' red'}) plt.pause (0.1) plt.ioff () plt.show ()

06 mathematical calculus

1. Limit:

Infinitesimal order

2. Differential calculus:

Derivative:

1) the derivative is the slope of the curve and is a response to the change of the curve.

2) the second derivative is a response to the change of the slope, showing the convexity and concavity of the curve.

Taylor series approximation

Newton method and gradient descent

3. Jensen inequality:

Convex function; Jensen inequality

Theory of probability:

1. Integral calculus:

Newton-Leibniz formula

2. Probability space

Random variables and probability: integral of probability density function; conditional probability; Conjugate Distribution

Probability distribution:

1) two-point distribution / Bay effort distribution

2) binomial distribution

3) Poisson distribution

4) uniform distribution

5) exponential distribution

6) normal distribution / Gaussian distribution

3. Law of large numbers and central limit

Linear algebra:

1) Matrix

2) Linear regression

At this point, I believe you have a deeper understanding of "Python data crawling, analysis, mining and distributed computing content". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.