In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article is about how to carry out the crawler implementation of bs4-based pull hook AI related work, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article. Let's take a look at it.
At the beginning of the year, we may be all kinds of job-hopping bar, looking at the people around them are also one by one to leave, in fact, it is a little sad. Everyone has his own ambition, not much comment. This article is mainly about how to grab the AI-related position data on the pull hook. In fact, the principle of grasping the data of other jobs is the same. As long as you know this, you can catch everything else. A total of less than 100 lines of code are used, the main information captured are "position name", "monthly salary", "company name", "company industry", "basic job requirements (experience, education)", "job description" and so on. The work involved includes "natural language processing", "machine learning", "deep learning", "artificial intelligence", "data mining", "algorithm engineer", "machine vision", "speech recognition", "image processing" and so on.
Here's a random screenshot to show you the information we want.
And then see where the information we need is.
Then the position details, yes, the url is in that href, so the key is to get that href and OK.
Let's go straight to the code.
First of all, we need to determine whether a url is a legitimate url, that is, the isurl method.
The urlhelper method is used to extract the html content of url and type a warning message of warning when an exception occurs.
Import urllib.request
From bs4 import BeautifulSoup
Import pandas as pd
Import requests
From collections import OrderedDict
From tqdm import tqdm, trange
Import urllib.request
From urllib import error
Import logging
Logging.basicConfig (level=logging.WARNING)
Def isurl (url):
If requests.get (url) .status_code = = 200:
Return True
Else:
Return False
Def urlhelper (url):
Try:
Req = urllib.request.Request (url)
Req.add_header ("User-Agent"
"Mozilla/5.0 (Windows NT 6.1; WOW64)"
"AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/45.0.2454.101 Safari/537.36")
Req.add_header ("Accept", "* / *")
Req.add_header ("Accept-Language", "zh-CN,zh;q=0.8")
Data = urllib.request.urlopen (req)
Html = data.read () .decode ('utf-8')
Return html
Except error.URLError as e:
Logging.warning ("{}" .format (e))
The following is the main program of the crawler, which needs to pay attention to the exception handling, which is very important, otherwise if half of the crawl is dead, it will be a tragedy if the crawl in front of it is not saved. Another is to say that BeautifulSoup this class is really very convenient, skilled use can save a lot of time.
Import urllib.request
From bs4 import BeautifulSoup
Import pandas as pd
Import requests
From collections import OrderedDict
From tqdm import tqdm, trange
Import urllib.request
From urllib import error
Import logging
Names = ['ziranyuyanchuli',' jiqixuexi', 'shenduxuexi',' rengongzhineng'
'shujuwajue', 'suanfagongchengshi',' jiqishijue', 'yuyinshibie'
'tuxiangchuli']
For name in tqdm (names):
Savedata = []
Page_number = 0
For page in range (1,31):
Page_number + = 1
If page_number% 5 = 0:
Print (page_number)
Rooturl = 'https://www.lagou.com/zhaopin/{}/{}/'.format(name, page)
If not isurl (rooturl):
Continue
Html = urlhelper (rooturl)
Soup = BeautifulSoup (html, "lxml")
Resp = soup.findAll ('div', attrs= {' class':'s positional listings'})
Resp = resp [0]
Resp = resp.findAll ('li', attrs= {' class': 'con_list_item default_list'})
For i in trange (len (resp)):
Position_link = resp [I] .findAll ('class':, attrs= {' class': 'position_link'})
Link = position_link [0] ['href']
If isurl (link):
Htmlnext = urlhelper (link)
Soup = BeautifulSoup (htmlnext, "lxml")
Try:
# Job description
Job_bt = soup.findAll ('dd'
Attrs= {'class':' job_bt'}) [0] .text
Except:
Continue
Try:
# Job name
Jobname = position_link [0]. Find ('h4'). Get_text ()
Except:
Continue
Try:
# basic job requirements
P_bot = resp [I] .findAll ('div'
Attrs= {'class':' paired bot'}) [0] .text
Except:
Continue
Try:
# monthly salary
Money = resp [I] .findAll ('span'
Attrs= {'class':' money'}) [0] .text
Except:
Continue
Try:
# Industry
Industry = resp [I] .findAll ('div'
Attrs= {'class':' industry'}) [0] .text
Except:
Continue
Try:
# name of the company
Company_name = resp [I] .findAll (
'div', attrs= {' class': 'company_name'}) [0] .text
Except:
Continue
Rows = OrderedDict ()
Rows ["jobname"] = jobname.replace (",")
Rows ["money"] = money
Rows ["company_name"] = company_name.replace ("\ n", "")
Rows ["p_bot"] = p_bot.strip (). Replace (",") \
Replace ("\ n", ",") .replace ("/", ",")
Rows ["industry"] = industry.strip ().\
Replace ("\ t", ") .replace ("\ n ",")
Rows ["job_bt"] = job_bt
Savedata.append (rows)
# Save to local
Df = pd.DataFrame (savedata)
Df.to_csv (". / datasets/lagou/ {} .csv" .format (name), index=None)
The above is how to implement the crawler of pull-hook AI related work based on bs4, and the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.