Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python crawl the company to check the company information on the website?

2025-03-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "Python how to crawl the company to check the company information in the website", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "Python how to crawl enterprises to check the company information in the website"!

Crawl the company to check the details of the company in the website according to the company name entered

1. Obtain headers

2. After the login is successful, you can query according to the company name entered to get the desired content.

3. Specialize the obtained text and summarize it into a dataframe, which can be saved as csv later.

4. Enter the company name

5. Finally, execute this code to query the details of all company names in the companys list and save it as csv.

1. Obtain headers

1. Enter the official website of the enterprise inquiry to register and log in.

2. Then press F12 to pop up the developer tool, click Network, and then you will see the search URL, click

Then we can find the header we need to copy, which is a very critical step. Keep in mind that this header is the header obtained by logging in successfully after you register, so that you can access the URL indefinitely for a certain period of time after saving it later.

From bs4 import BeautifulSoupimport requestsimport time# keep session # create a new session object sess = requests.session () # add headers (header is the enterprise to check the URL of your login, enter the account password to log in, and the header displayed after login, the top of this code describes the acquisition method) afterLogin_headers = {'User-Agent':' this code describes the acquisition method'} # post request (represents login behavior, which can be saved after logging in once To facilitate the subsequent execution of query instructions) login = {'user':' registered account', 'password':' password'} sess.post ('https://www.qcc.com',data=login,headers=afterLogin_headers)

The whole code means to log in disguised as a user (returning a 200 status code indicates a successful login).

2. After the login is successful, you can query according to the company name entered to get the desired content. Def get_company_message (company): # get the content of the page queried (all) search = sess.get ('https://www.qcc.com/search?key={}'.format(company),headers=afterLogin_headers,timeout=10) search.raise_for_status () search.encoding =' utf-8' # linux utf-8 soup = BeautifulSoup (search.text,features= "html.parser") href = soup.find_all ('a') {'class':' title'}) [0] .get ('href') time.sleep (4) # get the page content queried (all) details = sess.get (href,headers=afterLogin_headers,timeout=10) details.raise_for_status () details.encoding =' utf-8' # linux utf-8 details_soup = BeautifulSoup (details.text,features= "html.parser") message = details_soup.text time.sleep (2) return message

The above code represents the execution of two steps.

① inquires about a company

② clicks on the new site that enters the first search result and returns the text content of that URL.

3. Specialize the acquired text. And summarize it into a dataframe, which is convenient to save as csvimport pandas as pddef message_to_df (message). Company: list_companys = [] Registration_status = [] Date_of_Establishment = [] registered_capital = [] contributed_capital = [] Approved_date = [] Unified_social_credit_code = [] Organization_Code = [] companyNo = [] Taxpayer_Identification_Number = [] sub_Industry = [] enterprise_type = [] Business_Term = [] Registration_Authority = [ ] staff_size = [] Number_of_participants = [] sub_area = [] company_adress = [] Business_Scope = [] list_companys.append (company) Registration_status.append (message.split ('enrollment status') [1] .split ('n') [1] .split ('date of establishment') [0] .replace ('' ) Date_of_Establishment.append (message.split ('establishment date') [1] .split ('\ n') [1]. Replace (',') registered_capital.append (message.split ('registered capital') [1] .split ('RMB') [0] .replace ('') ) contributed_capital.append (message.split ('paid-up capital') [1] .split ('RMB') [0]. Replace (',') Approved_date.append (message.split ('approval date') [1] .split ('\ n') [1] .replace ('' ) try: credit = message.split ('uniform social credit code') [1] .split ('\ n') [1]. Replace (',') Unified_social_credit_code.append (credit) except: credit = message.split ('unified social credit code') [3] .split ('\ n') [1] .replace ('' '') Unified_social_credit_code.append (credit) Organization_Code.append (message.split ('organization code') [1] .split ('\ n') [1] .replace (',') companyNo.append (message.split ('industrial and commercial registration number') [1] .split ('\ n') [1] .replace ('' ) Taxpayer_Identification_Number.append (message.split ('taxpayer identification number') [1] .split ('\ n') [1]. Replace (',')) try: sub = message.split ('industry') [1] .split ('\ n') [1] .replace ('' '') sub_Industry.append (sub) except: sub = message.split ('industry') [1] .split ('yes') [1] .split (' ') [0] sub_Industry.append (sub) enterprise_type.append (message.split (' Enterprise Type') [1] .split ('\ n') [1] .replace ('',') Business_Term.append (message.split ('business term') [1]. Split ('registration authority') [0] .split ('\ n') [- 1] .replace ('' ) Registration_Authority.append (message.split ('Registration Authority') [1] .split ('\ n') [1]. Replace ('',') staff_size.append (message.split ('size of personnel') [1]. Split ('people') [0] .split ('\ n') [- 1] .replace ('' ) Number_of_participants.append (message.split ('number of insured') [1] .split ('region') [0] .replace (','). Split ('\ n') [2]) sub_area.append (message.split ('region') [1] .split ('\ n') [1] .replace ('') ) try: adress = message.split ('business scope') [0] .split ('business address') [1] .split ('view map') [0] .split ('\ n') [2] .replace ('' '') company_adress.append (adress) except: adress = message.split ('business scope') [1] .split ('business address') [1] .split () [0] company_adress.append (adress) Business_Scope.append (message.split ('business scope') [1] .split ('n') [1] .replace (',') df = pd.DataFrame ({'company': company) \ 'Registration status': Registration_status,\ 'Establishment date': Date_of_Establishment,\ 'registered Capital': registered_capital,\ 'contributed Capital': contributed_capital,\ 'approval date': Approved_date \ 'unified social credit code': Unified_social_credit_code,\ 'organization code': Organization_Code,\ 'industrial and commercial registration number': companyNo,\ 'taxpayer identification number': Taxpayer_Identification_Number,\ 'industry': sub_Industry \ 'Enterprise Type': enterprise_type,\ 'Business term': Business_Term,\ 'Registration Authority': Registration_Authority,\ 'personnel size': staff_size,\ 'number of insured': Number_of_participants \ 'region': sub_area,\ 'Business address': company_adress,\ 'Business scope': Business_Scope}) return df

This code is to get the text content for text recognition processing, can only deal with most of the content, there may be a very few null values, you can rewrite if you are interested.

4. Enter the company name

Here is just to write a case, so casually write a list, usually run their own code is to read their own csv file about the company name of the column, and then into a list)

# companys for testing = [Shenzhen Tencent computer system Co., Ltd., Alibaba (China) Co., Ltd.] # actual use # df_companys = pd.read_csv ('absolute path of your own directory / so-and-so .csv') # companys = df_companys ['company name'] .tolist () 5. Finally, execute this code to query the details of all company names in the companys list and save it as csv. For company in companys: try: messages = get_company_message (company) except: pass else: df = message_to_df (messages,company) if (company==companys [0]): df.to_csv ('absolute path to your own directory / xxx .csv', index=False,header=True) else: df.to_csv ('absolute path to your own directory / xxx .csv' Mode='a+',index=False,header=False) time.sleep (1)

At this point, you can get some detailed information about the two companies.

Ps: if you encounter an error in soup.find_all ('class':' title'}) [0] .get ('href'), it may be that Tianyan has updated the web code there, and you can update the code according to this operation.

① press F12 to enter the developer debugging page

③ We can see that this is an a tag, and class is the html code of title, so if an error is reported, it can be replaced according to this operation. For example, if class is changed to company_title, the code can also be changed to: soup.find_all ('class':, {' class': 'company_title'}) [0] .get (' href')

Finally, it is important to note that when crawling, you need to set the sleep time properly, otherwise it will be detected that the reptile robot is operating, and a pop-up window may pop up for you to verify, which will cause the loop to be interrupted. The second is to try not to crawl too much in a certain period of time, otherwise it will also be detected.

Paste the complete code here, you can refer to the wonderful use of learning BeautifuSoup.

From bs4 import BeautifulSoupimport requestsimport time# keep session # create a new session object sess = requests.session () # add headers (header is the enterprise to check the URL of your login, enter the account password to log in, and the header displayed after login, the top of this code describes the acquisition method) afterLogin_headers = {'User-Agent':' this code describes the acquisition method'} # post request (represents login behavior, which can be saved after logging in once To facilitate the subsequent execution of query instructions) login = {'user':' registered account', 'password':' password'} sess.post ('https://www.qcc.com',data=login,headers=afterLogin_headers)def get_company_message (company): # get the content of the queried web page (all) search = sess.get (' https://www.qcc.com/search?key={}'.format(company),headers=afterLogin_headers,) Timeout=10) search.raise_for_status () search.encoding = 'utf-8' # linux utf-8 soup = BeautifulSoup (search.text,features= "html.parser") href = soup.find_all (' averse, {'class':' title'}) [0] .get ('href') time.sleep (4) # get the page content (all) details = sess.get (href,headers=afterLogin_headers) Timeout=10) details.raise_for_status () details.encoding = 'utf-8' # linux utf-8 details_soup = BeautifulSoup (details.text,features= "html.parser") message = details_soup.text time.sleep (2) return messageimport pandas as pddef message_to_df (message Company: list_companys = [] Registration_status = [] Date_of_Establishment = [] registered_capital = [] contributed_capital = [] Approved_date = [] Unified_social_credit_code = [] Organization_Code = [] companyNo = [] Taxpayer_Identification_Number = [] sub_Industry = [] enterprise_type = [] Business_Term = [] Registration_Authority = [ ] staff_size = [] Number_of_participants = [] sub_area = [] company_adress = [] Business_Scope = [] list_companys.append (company) Registration_status.append (message.split ('enrollment status') [1] .split ('n') [1] .split ('date of establishment') [0] .replace ('' ) Date_of_Establishment.append (message.split ('establishment date') [1] .split ('\ n') [1]. Replace (',') registered_capital.append (message.split ('registered capital') [1] .split ('RMB') [0] .replace ('') ) contributed_capital.append (message.split ('paid-up capital') [1] .split ('RMB') [0]. Replace (',') Approved_date.append (message.split ('approval date') [1] .split ('\ n') [1] .replace ('' ) try: credit = message.split ('uniform social credit code') [1] .split ('\ n') [1]. Replace (',') Unified_social_credit_code.append (credit) except: credit = message.split ('unified social credit code') [3] .split ('\ n') [1] .replace ('' '') Unified_social_credit_code.append (credit) Organization_Code.append (message.split ('organization code') [1] .split ('\ n') [1] .replace (',') companyNo.append (message.split ('industrial and commercial registration number') [1] .split ('\ n') [1] .replace ('' ) Taxpayer_Identification_Number.append (message.split ('taxpayer identification number') [1] .split ('\ n') [1]. Replace (',')) try: sub = message.split ('industry') [1] .split ('\ n') [1] .replace ('' '') sub_Industry.append (sub) except: sub = message.split ('industry') [1] .split ('yes') [1] .split (' ') [0] sub_Industry.append (sub) enterprise_type.append (message.split (' Enterprise Type') [1] .split ('\ n') [1] .replace ('',') Business_Term.append (message.split ('business term') [1]. Split ('registration authority') [0] .split ('\ n') [- 1] .replace ('' ) Registration_Authority.append (message.split ('Registration Authority') [1] .split ('\ n') [1]. Replace ('',') staff_size.append (message.split ('size of personnel') [1]. Split ('people') [0] .split ('\ n') [- 1] .replace ('' ) Number_of_participants.append (message.split ('number of insured') [1] .split ('region') [0] .replace (','). Split ('\ n') [2]) sub_area.append (message.split ('region') [1] .split ('\ n') [1] .replace ('') ) try: adress = message.split ('business scope') [0] .split ('business address') [1] .split ('view map') [0] .split ('\ n') [2] .replace ('' '') company_adress.append (adress) except: adress = message.split ('business scope') [1] .split ('business address') [1] .split () [0] company_adress.append (adress) Business_Scope.append (message.split ('business scope') [1] .split ('n') [1] .replace (',') df = pd.DataFrame ({'company': company) \ 'Registration status': Registration_status,\ 'Establishment date': Date_of_Establishment,\ 'registered Capital': registered_capital,\ 'contributed Capital': contributed_capital,\ 'approval date': Approved_date \ 'unified social credit code': Unified_social_credit_code,\ 'organization code': Organization_Code,\ 'industrial and commercial registration number': companyNo,\ 'taxpayer identification number': Taxpayer_Identification_Number,\ 'industry': sub_Industry \ 'Enterprise Type': enterprise_type,\ 'Business term': Business_Term,\ 'Registration Authority': Registration_Authority,\ 'personnel size': staff_size,\ 'number of insured': Number_of_participants \ 'region': sub_area,\ 'Enterprise address': company_adress,\ 'Business scope': Business_Scope}) companys for return df# testing = ['Shenzhen Tencent computer system Co., Ltd.' 'Alibaba (China) Co., Ltd.] # actual use # df_companys = pd.read_csv (' absolute path to your own directory / so-and-so .csv') # companys = df_companys ['company name'] .tolist () for company in companys: try: messages = get_company_message (company) except: pass else: df = message_to_df (messages Company) if (company==companys [0]): df.to_csv ('absolute path to your own directory / xxx .csv, index=False,header=True) else: df.to_csv (' absolute path to your own directory / xxx .csv, mode='a+',index=False,header=False) time.sleep (1) to this I believe that everyone has a deeper understanding of "how to crawl the company to check the company information in the website". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report