How to implement bs4-based pull-hook AI related work crawler 07/01 Update SLTechnology News&Howtos

How to implement bs4-based pull-hook AI related work crawler

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to carry out the crawler implementation of bs4-based pull hook AI related work, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article. Let's take a look at it.

At the beginning of the year, we may be all kinds of job-hopping bar, looking at the people around them are also one by one to leave, in fact, it is a little sad. Everyone has his own ambition, not much comment. This article is mainly about how to grab the AI-related position data on the pull hook. In fact, the principle of grasping the data of other jobs is the same. As long as you know this, you can catch everything else. A total of less than 100 lines of code are used, the main information captured are "position name", "monthly salary", "company name", "company industry", "basic job requirements (experience, education)", "job description" and so on. The work involved includes "natural language processing", "machine learning", "deep learning", "artificial intelligence", "data mining", "algorithm engineer", "machine vision", "speech recognition", "image processing" and so on.

Here's a random screenshot to show you the information we want.

And then see where the information we need is.

Then the position details, yes, the url is in that href, so the key is to get that href and OK.

Let's go straight to the code.

First of all, we need to determine whether a url is a legitimate url, that is, the isurl method.

The urlhelper method is used to extract the html content of url and type a warning message of warning when an exception occurs.

Import urllib.request

From bs4 import BeautifulSoup

Import pandas as pd

Import requests

From collections import OrderedDict

From tqdm import tqdm, trange

Import urllib.request

From urllib import error

Import logging

Logging.basicConfig (level=logging.WARNING)

Def isurl (url):

If requests.get (url) .status_code = = 200:

Return True

Else:

Return False

Def urlhelper (url):

Try:

Req = urllib.request.Request (url)

Req.add_header ("User-Agent"

"Mozilla/5.0 (Windows NT 6.1; WOW64)"

"AppleWebKit/537.36 (KHTML, like Gecko)"

"Chrome/45.0.2454.101 Safari/537.36")

Req.add_header ("Accept", "* / *")

Req.add_header ("Accept-Language", "zh-CN,zh;q=0.8")

Data = urllib.request.urlopen (req)

Html = data.read () .decode ('utf-8')

Return html

Except error.URLError as e:

Logging.warning ("{}" .format (e))

The following is the main program of the crawler, which needs to pay attention to the exception handling, which is very important, otherwise if half of the crawl is dead, it will be a tragedy if the crawl in front of it is not saved. Another is to say that BeautifulSoup this class is really very convenient, skilled use can save a lot of time.

Import urllib.request

From bs4 import BeautifulSoup

Import pandas as pd

Import requests

From collections import OrderedDict

From tqdm import tqdm, trange

Import urllib.request

From urllib import error

Import logging

Names = ['ziranyuyanchuli',' jiqixuexi', 'shenduxuexi',' rengongzhineng'

'shujuwajue', 'suanfagongchengshi',' jiqishijue', 'yuyinshibie'

'tuxiangchuli']

For name in tqdm (names):

Savedata = []

Page_number = 0

For page in range (1,31):

Page_number + = 1

If page_number% 5 = 0:

Print (page_number)

Rooturl = 'https://www.lagou.com/zhaopin/{}/{}/'.format(name, page)

If not isurl (rooturl):

Continue

Html = urlhelper (rooturl)

Soup = BeautifulSoup (html, "lxml")

Resp = soup.findAll ('div', attrs= {' class':'s positional listings'})

Resp = resp [0]

Resp = resp.findAll ('li', attrs= {' class': 'con_list_item default_list'})

For i in trange (len (resp)):

Position_link = resp [I] .findAll ('class':, attrs= {' class': 'position_link'})

Link = position_link [0] ['href']

If isurl (link):

Htmlnext = urlhelper (link)

Soup = BeautifulSoup (htmlnext, "lxml")

Try:

# Job description

Job_bt = soup.findAll ('dd'

Attrs= {'class':' job_bt'}) [0] .text

Except:

Continue

Try:

# Job name

Jobname = position_link [0]. Find ('h4'). Get_text ()

Except:

Continue

Try:

# basic job requirements

P_bot = resp [I] .findAll ('div'

Attrs= {'class':' paired bot'}) [0] .text

Except:

Continue

Try:

# monthly salary

Money = resp [I] .findAll ('span'

Attrs= {'class':' money'}) [0] .text

Except:

Continue

Try:

# Industry

Industry = resp [I] .findAll ('div'

Attrs= {'class':' industry'}) [0] .text

Except:

Continue

Try:

# name of the company

Company_name = resp [I] .findAll (

'div', attrs= {' class': 'company_name'}) [0] .text

Except:

Continue

Rows = OrderedDict ()

Rows ["jobname"] = jobname.replace (",")

Rows ["money"] = money

Rows ["company_name"] = company_name.replace ("\ n", "")

Rows ["p_bot"] = p_bot.strip (). Replace (",") \

Replace ("\ n", ",") .replace ("/", ",")

Rows ["industry"] = industry.strip ().\

Replace ("\ t", ") .replace ("\ n ",")

Rows ["job_bt"] = job_bt

Savedata.append (rows)

# Save to local

Df = pd.DataFrame (savedata)

Df.to_csv (". / datasets/lagou/ {} .csv" .format (name), index=None)

The above is how to implement the crawler of pull-hook AI related work based on bs4, and the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.