Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to write a script for collecting subdomain names

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to use Python to write a script for collecting sub-domain names. The content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

0 × 00 preface

Task:

Use scripts to collect website subdomain name information with the help of search engines.

Prepare the tool:

Python installation package, pip, http request library: requests library, rule library: re library.

The subdomain name is relative to the primary domain name of the site. For example, the main domain name of Baidu is: baidu.com, which is a * domain name, while the * domain name is preceded by "." Separate and add different characters, such as zhidao.baidu.com, then this is a secondary domain name, similarly, continue to expand the hostname of the primary domain name, such as club.user.baidu.com, which is a third-level domain name, and so on.

0X00 text

What is the process of manually collecting sub-domain names?

For example, we want to collect qq.com, the main domain name, all the subdomains that can be searched by Baidu search engine.

First, use the syntax of the search domain name to search ~

Search domain syntax: site:qq.com

Then, there is the subdomain information we want in the search results, and we can right-click, look at the elements, and copy it out.

How to replace the tedious manual operation with python?

In fact, it is the step of automating manual collection with code:

Collector production begins:

1. Initiate a search http request

Request us to use python's third-party http library, requests

If additional installation is required, you can use pip to install pipinstallrequests

Basic use of requests-example:

Help (requests) check requests's help manual.

Dir (requests) looks at all the properties and methods of the requests object.

Requests.get (& # 39; http://www.baidu.com'

Initiate a GET request.

All right, to add the basics, let's initiate a request and get the contents of the return package.

#-*-coding:utf-8-*- import requests # Import requests library url=' http://www.baidu.com/s?wd=site:qq.com' # set url request response=requests.get (url) .content # get request, content is to get the body print response of the return packet

There is so much content in the return package that we need to find the subdomain we want and copy it.

From looking at the elements, we can see that the subdomain name is wrapped in a piece of code, as follows:

Style= "text-decoration:none;" > chuangshi.qq.com/

two。 Regular expression-(. *?) Shining debut:

Regular rules: style= "text-decoration:none;" > (. *?) /

Is regular expression difficult? It's hard. Is it complicated? It's complicated.

In the simplest regular expression, however, we use (. *?) the data we want. To express it.

Basic use of re-example:

Suppose we were going to take the ILike Study from a string & # 39: Ten: 123xxIxx123xLikexx123xxStudyxxwords 39;, we could write:

Eg='123xxIxx123xxLikexx123xxStudyxx' printre.findall (& # 39 * x (. *?) xx',eg) # print result [& # 39 *) print result [& # 39 * *]

Based on the above example, the subdomain name can also be obtained according to the gourd painting ladle.

#-*-coding:utf-8-*- importrequests# Import requests Library importre# Import re Library url=' http://www.baidu.com/s?wd=site:qq.com'# set url request response=requests.get (url) .content # get request, content is to get the body of the return package # key points, the following code ~ subdomain=re.findall (& # 39 leading style = "text-decoration:none;" > (. *?) / & # 39 leading force response) printsubdomain

Results:

[& # 39 leading chuangshi.qq.composer: 39leadingchuangshi.qq.compose39leading1314.qq.composing39leadinglol.qq.composing39leadmology "39leadingtgp.qq.composing39leadenopen.qq.composing39makingopen.qq.composing39leading.qq.composing" 39leading "39httpsPlus", "39leadMore" 39leading ac.qq.compose39;]

3. The treatment of turning pages

The subdomain name obtained above is only the content of the * * page that returns the result. How to get the results of all the pages?

Key=qq.com # add page number for url: url= "http://www.baidu.com.cn/s?wd=site:"+key+"&cl=3&pn=0" url=" http://www.baidu.com.cn/s?wd=site:"+key+"&cl=3&pn=10" url= "http://www.baidu.com.cn/s?wd=site:"+key+"&cl=3&pn=20" # pn=0 is page * *, pn=10 is page 2, pn=20 is page 3...

Oh, my God, a hundred pages. Do I have to write a hundred url? Of course not, looping statements solve your problem.

Foriinrange: # suppose there are 100 pages of i=i*10 url= "http://www.baidu.com.cn/s?wd=site:"+key+"&cl=3&pn=%s"%i

4. Too many duplicates? Want to go heavy?

Basics:

Data type of python: set

Set holds a series of elements, but the elements of set have no duplicates and are unordered.

You create a set by calling set () and passing in an element of list,list that will be the element of set.

Sites=list (set (sites)) # using set to remove weight

The result of regular expression matching is a list list, which can be removed by calling the set () method.

5. Complete Code & & Summary

The following is the complete code for Baidu search engine to crawl sub-domain name.

#-*-coding:utf-8-*- importrequests importre key= "qq.com" sites= [] match='style= "text-decoration:none;" > (. *?) / & # 39 Foriinrange (48): i=i*10 url= "http://www.baidu.com.cn/s?wd=site:"+key+"&cl=3&pn=%s"%i response=requests.get (url). Content subdomains=re.findall (match,response) sites+=list (subdomains) site=list (set (sites)) # set () to remove printsite print" Thenumberofsitesis%d "% len (site) foriinsite: printi

Result Screenshot:

0X02 summary

Its real sub-domain name mining is a small crawler, but we use Baidu's engine to crawl, but the amount of data crawled with bing engine will be more than Baidu, so it is recommended that we use bing engine, the code writing method is more or less the same as Baidu's code, give a small tip,bing search domain name using the syntax of domain:qq.com.

On how to use Python to write information to collect sub-domain name script to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report