Python url acquisition 07/06 Update SLTechnology News&Howtos

Python url acquisition

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Python uses Baidu to do url collection

pip install tableprint

paramiko==2.0.8

Syntax: python url_collection.py -h Output help information

python url_collection.py Information to collect-p Number of pages-t Number of processes-o File name and format saved

New file touch url_collection.py

Write the formal part of the code

#coding: utf-8

import requests

from bs4 import BeautifulSoup as bs

import re

from Queue import Queue

import threading

from argparse import ArgumentParser

logo="""

u u l | ccccc ooooo l l eeeeee cccccc ttttttt

u u r rr l | c o o l l e c t

u u r r r l | c o o l l eeeeee c t

u u r l | c o o l l e c t

u u u r l | c o o l l e c t

uuuuuuuu u r lllll | ccccc ooooo llllll lllll eeeeee cccccc t

By : Snow wolf

"""

print(logo)

arg = ArgumentParser(description='baidu_url_collect py-script by snowwolf')

arg.add_argument('keyword',help='keyword like inurl:.? id= for searching sqli site')

arg.add_argument('-p','--page', help='page count', dest='pagecount', type=int)

arg.add_argument('-t','--thread', help='the thread_count', dest='thread_count', type=int, default=10)

arg.add_argument('-o','--outfile', help='the file save result', dest='outfile', default='result.txt')

result = arg.parse_args()

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}

class Bd_url(threading.Thread):

def init(self, que):

threading.Thread.init(self)

self._ que = que

def run(self): while not self._ que.empty(): URL = self._ que.get() try: self.bd_url_collect(URL) except Exception,e: print e passdef bd_url_collect(self, url): r = requests.get(url, headers=headers, timeout=3) soup = bs(r.content, 'lxml', from_encoding='utf-8') bqs = soup.find_all(name='a', attrs={'data-click':re.compile(r'. '), 'class':None}) for bq in bqs: r = requests.get(bq['href'], headers=headers, timeout=3) if r.status_code == 200: print r.url with open(result.outfile, 'a') as f: f.write(r.url + '\n')

def main():

thread = []

thread_count = result.thread_count

que = Queue()

for i in range(0,(result.pagecount-1)*10,10):

que.put('https://www.baidu.com/s? wd=' + result.keyword + '&pn=' + str(i))

for i in range(thread_count): thread.append(Bd_url(que))for i in thread: i.start()for i in thread: i.join()

if name == 'main':

main()

code end on

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.