In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Python uses Baidu to do url collection
pip install tableprint
paramiko==2.0.8
Syntax: python url_collection.py -h Output help information
python url_collection.py Information to collect-p Number of pages-t Number of processes-o File name and format saved
New file touch url_collection.py
Write the formal part of the code
#coding: utf-8
import requests
from bs4 import BeautifulSoup as bs
import re
from Queue import Queue
import threading
from argparse import ArgumentParser
logo="""
u u l | ccccc ooooo l l eeeeee cccccc ttttttt
u u r rr l | c o o l l e c t
u u r r r l | c o o l l eeeeee c t
u u r l | c o o l l e c t
u u u r l | c o o l l e c t
uuuuuuuu u r lllll | ccccc ooooo llllll lllll eeeeee cccccc t
By : Snow wolf
"""
print(logo)
arg = ArgumentParser(description='baidu_url_collect py-script by snowwolf')
arg.add_argument('keyword',help='keyword like inurl:.? id= for searching sqli site')
arg.add_argument('-p','--page', help='page count', dest='pagecount', type=int)
arg.add_argument('-t','--thread', help='the thread_count', dest='thread_count', type=int, default=10)
arg.add_argument('-o','--outfile', help='the file save result', dest='outfile', default='result.txt')
result = arg.parse_args()
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
class Bd_url(threading.Thread):
def init(self, que):
threading.Thread.init(self)
self._ que = que
def run(self): while not self._ que.empty(): URL = self._ que.get() try: self.bd_url_collect(URL) except Exception,e: print e passdef bd_url_collect(self, url): r = requests.get(url, headers=headers, timeout=3) soup = bs(r.content, 'lxml', from_encoding='utf-8') bqs = soup.find_all(name='a', attrs={'data-click':re.compile(r'. '), 'class':None}) for bq in bqs: r = requests.get(bq['href'], headers=headers, timeout=3) if r.status_code == 200: print r.url with open(result.outfile, 'a') as f: f.write(r.url + '\n')
def main():
thread = []
thread_count = result.thread_count
que = Queue()
for i in range(0,(result.pagecount-1)*10,10):
que.put('https://www.baidu.com/s? wd=' + result.keyword + '&pn=' + str(i))
for i in range(thread_count): thread.append(Bd_url(que))for i in thread: i.start()for i in thread: i.join()
if name == 'main':
main()
code end on
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.