Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use JavaScript crawler Baidu Tieba data

2025-03-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to use JavaScript crawler Baidu Tieba data". In daily operation, I believe many people have doubts about how to use JavaScript crawler Baidu Tieba data. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to use JavaScript crawler Baidu Tieba data". Next, please follow the editor to study!

The JavaScript function used to crawl posts is as follows:

Function getPostByAJAX (requestURL) {var html = $.ajax ({url: requestURL, async: false}) .responseText; return html;}

Is a very simple AJAX request:

The value of the input parameter requestURL passed into the function is: http://tieba.baidu.com/i/i/my_tie

The above url, which I can access directly in the browser, works well and returns the data of 47.2KB size.

However, when I used the AJAX function to access the url, I encountered the following error in the Chrome developer tool:

However, there is no detailed information about this error, and I have no clue to sort it out.

As a result, there is an opportunity to display the hidden skills of this Chrome developer tool.

Open in the Chrome address bar: chrome://net-internals

Click the Event tab:

Go back to my Baidu Tieba crawler page, which initiates an AJAX request, refreshes it by pressing F5, sends a new request, and then goes back to the Chrome developer tool.

The details of the AJAX request are shown in detail. Find the url that I care about: http://tieba.baidu.com/i/i/my_tie

The chrome://net-internals interface shows the details of web requests in much more detail than in the Network tab:

Some clues that caused this error were found in the response header field:

From the screenshot above, we can see that the HTTP response status field is 302 and the location field is "http://static.tieba.baidu.com/tb/error.html?ErrType=1". These two clues give me a hint: this error must have something to do with the login status of Baidu: the url I use does not support anonymous access.

I was able to access the url in the browser successfully because my Cookie was working.

Goole found a solution for a while. Add the following to the request parameters of AJAX:

XhrFields: {withCredentials: true}

In this way, I can send both my cookie and AJAX requests to the Baidu server.

With this parameter added, the request can get the desired response.

Using the hidden skill of Chrome developer tools, we can also observe some other details that are usually difficult to find.

For example, my AJAX request is made through the local jQuery library file, and the local file jquery1.7.1.js is referenced directly in my HTML code. At runtime, the jquery1.7.1.js file needs to be loaded into memory.

Using this hidden skill, I can now observe that jquery1.7.1.js is read into memory in chunks, referring to the current parameter of URL_REQUEST_JOB_BYTES_READ: byte_count = 32768. A total of 8 blocks were read, and the last block only read the remaining 22285 bytes because the size was less than 32768.

The total number of bytes of these 8 blocks 251661 is exactly the number of bytes of jquery1.7.1.js. This proves once again that the functionality provided by chrome://net-internals is more powerful than that provided in the Network tab.

At this point, the study on "how to use JavaScript crawler Baidu Tieba data" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report