In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "how to use Python to collect the whole site form data", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "how to use Python to collect the whole site form data"!
Target analysis
The website given to me by big brother is this: https://www.ctic.org/crm?tdsourcetag=s_pctim_aiomsg
Open it like this:
According to my recent experience in learning crawlers, I usually just add parameters such as year, month and day to url, and then use requests.get to get response to parse html, so this time it should be about the same-in addition to finding ways to get specific years, place names and crop names, other parts can be used with slight changes to the previous code, without challenging work, life is really boring
After clicking View Summary, the target page looks like this.
The data in that big table is the target data, and it doesn't seem to be a big deal--
Something is wrong
The URL of the web page where the target data is located is as follows: https://www.ctic.org/crm/?action=result, the parameters you just selected are not used as parameters for url! The web address and web page have changed, so it's not ajax.
This is very different from what I imagined.
Try to get the target page
Let me take a look at what happened when Kangkang clicked on the View Summary button: right-click View Summary to check like this:
To be honest, this is the first time I've come across the job of submitting a form. In the past, it was possible that God blessed me and get could take care of everything. Today, I finally ran into a post.
Click View Summary and look for the first item of network in DevTools:
No matter three, seven, twenty-one, give post a try.
Import requests url = 'https://www.ctic.org/crm?tdsourcetag=s_pctim_aiomsg' headers = {' user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64) X64)''AppleWebKit/537.36 (KHTML, like Gecko)' 'Chrome/74.0.3729.131 Safari/537.36',' Host': 'www.ctic.org'} data = {' _ csrf': 'SjFKLWxVVkkaSRBYQWYYCA1TMG8iYR8ReUYcSj04Jh5EBzIdBGwmLw==',' CRMSearchForm [year]': '2011', 'CRMSearchForm [format]': 'Acres',' CRMSearchForm [area]': 'County' 'CRMSearchForm [region]': 'Midwest',' CRMSearchForm [state]': 'IL',' CRMSearchForm [county]': 'Adams',' CRMSearchForm [type _ type]': 'All',' summary': 'county'} response = requests.post (url, data=data, headers=headers) print (response.status_code)
Sure enough, the output was 400. I guess this is what the legendary cookies is doing? I can't help but feel guilty and eager to try when I only see Chapter 6 of Python3 Web Crawler.
First of all, I don't know exactly what cookies is, but I just know that it is used to maintain a conversation. It should come from the first get. Let's take a look at it first:
Response1 = requests.get (url, headers=headers) if response1.status_code = 200: cookies = response1.cookies print (cookies)
Output:
Nah, don't understand, don't read it, just put it in post.
Response2 = requests.post (url, data=data, headers=headers, cookies=cookies) print (response2.status_code)
The atmosphere suddenly became a little anxious. I gave you cookies. What else do you want?!
Suddenly, I found something: the _ csrf in the data that came with the post request seemed suspicious in the first place. Where have I seen it before?
There seems to be a _ csrf in the cookies that I don't understand at all! But the structures of the two _ csrf values are obviously different, so it really doesn't work to try to change the _ csrf in data to the _ csrf in cookies.
But I gradually had an idea: although the two _ csrf are not equal, they should match. My data just now comes from the browser, and the cookies comes from the python program, so it doesn't match!
So I clicked on the DevTools,Ctrl+F of the browser and searched for it. Hey, I found:
And
These three places.
The csrf_token on the next line in the first place is obviously the _ csrf in the data of the post request, and the other two are functions in js. Although js didn't study hard, you can see that they got the name of the state and county through the post request, Binggo! Solve two problems at once.
In order to verify my conjecture, I intend to directly use requests to obtain the HTML and cookies of the page before clicking View Summary, and take the csrf_ token value extracted from HTML as the _ csrf value in the data requested by post when clicking View Summary, along with cookies, so that the two _ csrf should match:
From lxml import etree response1 = requests.get (url, headers=headers) cookies= response1.cookies html = etree.HTML (response1.text) csrf_token = html.xpath ('/ html/head/meta [3] / @ content') [0] data.update ({'_ csrf': csrf_token}) response2 = requests.post (url, data=data, headers=headers, cookies=cookies) print (response2.status_code)
Output 200, although different from the 302 shown by Chrome, but also indicates success, that does not matter. Write response2.text to the html file and open it like this:
Yeah, the data is all here! It means my conjecture is right! Then try requests.Session (), which I have never used before, to maintain the session and automatically handle cookies.
Try pandas library to extract web page table
Now that you have the HTML of the target page, test pandas.read_html 's ability to extract web forms before getting all the years, regions, states, and counties.
When writing the pandas.read_html function, IDE automatically completes what is found in the drop-down list when writing the code. I always wanted to try it. Today, I took the opportunity to pull it out:
Import pandas as pd df = pd.read_html (response2.text) [0] print (df)
Output:
Yeah! Got it, it is indeed more convenient than their own handwritten extraction, and numerical strings are automatically converted into numerical values, excellent!
Prepare all parameters
The next step is to get all the years, regions, states, and counties. The year and region are written in HTML and obtained directly by xpath:
According to the two js functions found earlier, the state name and county name should be obtained by post request, in which the state name should be obtained according to the region name, and the county name should be obtained according to the state name, with a two-layer loop.
Def new (): session = requests.Session () response = session.get (url=url, headers=headers) html = etree.HTML (response.text) return session, html session Html = new () years = html.xpath ('/ / * [@ id= "crmsearchform-year"] / option/text ()') regions = html.xpath ('/ / * [@ id= "crmsearchform-region"] / option/text ()') _ csrf = html.xpath ('/ html/head/meta [3] / @ content') [0] region_state = {} state_county = {} for region in regions: data = {'region': region '_ csrf': _ csrf} response = session.post (url_state, data=data) html = etree.HTML (response.json ()) region_ state [region] = {x: y for x, y in zip (html.xpath (' / / option/@value')) Html.xpath ('/ / option/text ())} for state in region_state [region]: data= {'state': state,' _ csrf': _ csrf} response = session.post (url_county, data=data) html = etree.HTML (response.json ()) state_ County [state] = html.xpath ('/ / option/@value')
Tut-tut, if you use requests.Session, you don't need to manage cookies at all. It's convenient! The specific state and county names will not be released, which is really too much. Then organize all the possible combinations of year, region, state and county into a csv file, and then read and construct the data dictionary of the post request directly from csv:
Remain = [[str (year), str (region), str (state), str (county)] for year in years for region in regions for state in region_ state [region] for county in state_ state] remain = pd.DataFrame (remain, columns= ['CRMSearchForm [year]', 'CRMSearchForm [region]', 'CRMSearchForm [state]' 'CRMSearchForm]) remain.to_csv ('remain.csv', index=False) # because the state name has an acronym and a full name Also save a local copy of import json with open ('region_state.json',') as json_file: json.dump (region_state, json_file, indent=4)
I looked at it and saw that there were 49473 lines-that is, at least 49473 post requests had to be sent to climb all the data, and it would take about ten times as many clicks to get it by hand.
Official start
So start climbing.
Import pyodbc with open ("region_state.json") as json_file: region_state = json.load (json_file) data = pd.read_csv ('remain.csv') # read crawled cnxn = pyodbc.connect (' DRIVER= {Microsoft Access Driver (* .mdb, * .accdb)} '' DBQ=./ctic_crm.accdb') crsr = cnxn.cursor () crsr.execute ('select Year_, Region, State, County from ctic_crm') done = crsr.fetchall () done = [list (x) for x in done] done = pd.DataFrame ([list (x) for x in done], columns= [' CRMSearchForm [region]]' 'CRMSearchForm [state]', 'CRMSearchForm [year]]) done [' CRMSearchForm [year]'] = done ['CRMSearchForm [year]'] .astype ('int64') state2st = {y: x for z in region_state.values () for x Y in z.items ()} done ['CRMSearchForm [state]'] = [state2st] for x in done ['CRMSearchForm [state] # exclude crawled remain = data.append (done) remain = remain.drop_duplicates (state) total = len (remain) print (f' {CRMSearchForm) del data #% remain ['CRMSearchForm [year]'] = remain ['CRMSearchForm [ Year]'] .astype ('str') columns = [' Crop' 'Total_Planted_Acres', 'Conservation_Tillage_No_Till',' Conservation_Tillage_Ridge_Till', 'Conservation_Tillage_Mulch_Till',' Conservation_Tillage_Total', 'Other_Tillage_Practices_Reduced_Till15_30_Residue',' Other_Tillage_Practices_Conventional_Till0_15_Residue'] fields = ['Year_' 'Units',' Area', 'Region',' State', 'County'] + columns data = {' user-agent': 'Mozilla/5.0]:' Acres', 'CRMSearchForm [area]': 'County',' CRMSearchForm [format _ type]': 'All',' summary': 'CRMSearchForm} headers = {' CRMSearchForm (CRMSearchForm) Win64 X64) 'AppleWebKit/537.36 (KHTML, like Gecko)' 'Chrome/74.0.3729.131 Safari/537.36',' Host': 'www.ctic.org',' Upgrade-Insecure-Requests':'1', 'DNT':' 1' 'Connection':' keep-alive'} url= 'https://www.ctic.org/crm?tdsourcetag=s_pctim_aiomsg' headers2 = headers.copy () headers2 = headers2.update ({' Referer': url, 'Origin':' https://www.ctic.org'}) def new (): session = requests.Session () response = session.get (url=url Headers=headers) html = etree.HTML (response.text) _ csrf = html.xpath ('/ html/head/meta [3] / @ content') [0] return session, _ csrf session, _ csrf = new () for _, row in remain.iterrows (): temp = dict (row) data.update (temp) data.update ({'_ csrf': _ csrf}) while True: try: response = session.post (url, data=data, headers=headers2) Timeout=15) break except Exception as e: session.close () print (e) print ('nSleep 30s.n') time.sleep (30) session _ csrf = new () data.update ({'_ csrf': _ csrf}) df = pd.read_html (response.text) [0] .dropna (how='all') df.columns = columns df ['Year_'] = int (temp [' Area']]) df ['Units'] =' Acres' df ['Area'] =' County' df ['Region'] = temp [' CRMSearchForm [region]'] df ['State'] = region_state [temp [' CRMSearchForm [region] [temp ['County'] = temp [' CRMSearchForm [state]'] df = df.reindex (columns=fields) for record in df.itertuples (index=False): tuple_record = tuple (record) sql_insert = f'INSERT INTO ctic_crm VALUES {tuple_record} 'sql_insert = sql_insert.replace (' Nan,',', null,') crsr.execute (sql_insert) crsr.commit () print (total, row.to_list ()) total-= 1 else: print ('toll') Crsr.close () cnxn.close ()
Notice that there is a try...except.. in the middle Statement, because Connection aborted errors occur from time to time, sometimes break once 9000 times, sometimes break once, which is why I add read crawled and exclude crawled, and worry about being identified crawlers, enrich the headers writing (it seems to be useless), and pause for 30 seconds and reopen a session each time
Then I left the program open for a weekend, and finally dialed out on the command line. I looked at 816288 records in Access and thought: try multithreading and proxy pool next time.
Thank you for reading, the above is the content of "how to use Python to collect form data of the whole station". After the study of this article, I believe you have a deeper understanding of how to use Python to collect form data of the whole station, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.