How to use jsoup to realize crawling image crawler 04/28 Update SLTechnology News&Howtos

How to use jsoup to realize crawling image crawler

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "how to use jsoup to capture picture crawlers". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

First edition:

ThreadPoolExecutor executor = new ThreadPoolExecutor (6,6,0, TimeUnit.SECONDS, new LinkedBlockingQueue); for (int j = 1; j {/ / 1. Grab the web page and get the picture url / / 2. Save the picture / / 3 according to url. Record the success and failure information to the local txt} after saving;}

The program seems to have no problem, only opened 6 threads operation, at first did not dare to open too many threads, for fear of being blocked by the website. no, no, no.

But it is too slow to run, only climbing more than 10 gigabytes in one night. at present, there are two main problems in the analysis:

1. Concurrent operation of local txt slows down the execution of a single task

two。 Threads do not make full use of

First, take a look at the file manipulation method, which comes from NIO:

Files.write (log, attr.getBytes ("utf8"), StandardOpenOption.APPEND)

By looking at the source code, it is found that this method will construct an OutputStream to call the write method, and there is a synchronized on the write method, so the multithreaded operation will undoubtedly turn into a weight lock.

So if you want to log, it's best to let them manipulate the file without thread competition.

Then optimize the multithreading operation. It is definitely slower to download pictures than to get url. Would it be better to get url first and then download pictures according to url?

First optimization:

/ used to record all urlQueue queue = new ConcurrentLinkedQueue (); / / used to record all log queue logQueue = new ConcurrentLinkedQueue (); / / all tasks List allTasks = new ArrayList (); for (int j = 1; j {/ / get url, put in queue});} / / use ForkJoin to execute tasks BatchTaskRunner.execute recording url in parallel (allTasks, taskPerThread, tasks-> {tasks.forEach (t-> t.accept (null));}) / / download all url parallel execution List list = queue.stream () .collect (Collectors.toList ()); BatchTaskRunner.execute (list, taskPerThread, tasks-> {tasks.forEach (/ / 1). Download file / / 2. Put the success or failure of url in logQueue);}); / / Last log logQueue.forEach (/ / save all logs to local txt)

There are three main steps here:

1. Execute tasks in parallel, grab url and put it into queue

two。 Execute the download in parallel and fetch url from queue

3. Save logs from logQueue to local

Analysis: first grab all the url, and then save them in parallel; put the saved log to the end, and the last log after saving the picture doesn't matter, but I found that there are still problems at run time:

I'll go, why do we have to release url first and then deal with it! At the same time, we also take the task, and it is not faster for the remaining tasks to be executed in parallel.

All right, with this idea in mind, let's just do it:

Second optimization:

/ * * the second addition of logical start**/// controls the execution of the main thread CountDownLatch countDownLatch = new CountDownLatch (totalPageSize); / / the thread pool used to consume queue ThreadPoolExecutor executor = new ThreadPoolExecutor (12,12,0, TimeUnit.SECONDS, new SynchronousQueue ()) / / switch for spin volatile boolean flag = false;/** the second added logical end**/// is used to record all urlQueue queue = new ConcurrentLinkedQueue (); / / to record all logs queue logQueue = new ConcurrentLinkedQueue () / / all tasks List allTasks = new ArrayList (); for (int j = 1; j {/ / get url, put in queue}) } / / A thread is opened to execute, mainly to make it operate new Thread asynchronously (()-> {/ / call countDownLatch.countDown () BatchTaskRunner.execute (allTasks, taskPerThread, tasks-> {tasks.forEach (t-> t.accept (null));}) in countDownLatch.countDown () BatchTaskRunner.execute (allTasks, taskPerThread, tasks-> {tasks.forEach (t-> t.accept (null));}); / / consume for (int I = 0; I) while fetching.

< 12; i++) { executor.execute(()->

{try {takeQueue (); / / get url from queue and consume. If semaphore returns to zero, set flag to true} catch (InterruptedException e) {}});} for (;;) {if (flag) {break;} Thread.sleep (10000) } countDownLatch.await (); executor.shutdownNow (); / / it is no longer necessary to execute if (queue.size () = = 0) {return;} / / download all url parallel execution List list = queue.stream (). Collect (Collectors.toList ()); BatchTaskRunner.execute (list, taskPerThread, tasks-> {tasks.forEach (/ / 1). Download file / / 2. Put the success or failure of url in logQueue);}); / / Last log logQueue.forEach (/ / save all logs to local txt)

The logic of the takeQueue method:

Void takeQueue () throws InterruptedException {for (;;) {long count = countDownLatch.getCount (); / / if you don't return to zero, you will always consume if (count > 0) {String poll = queue.poll () If (poll! = null) {consumer.accept (poll); / / download} else {Thread.sleep (3000) according to url }} else {flag = true; return;}

Probably played with a logic, the log is no longer important.

The main thread spins, saves url and downloads concurrently. If the logic of saving url is finished, there is still url in the queue, then download it in parallel.

Seeing that all the threads are in use, I feel much better.

Even in consumption, there are still more and more objects in queue.

This is the end of the content of "how to use jsoup to capture picture crawler". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.