Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Golddata collect data that requires login / session?

2025-03-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Summary

This article will introduce the use of GoldData semi-automatic login function to collect data that need to log in to the website. The semi-automatic login function of GoldData means that login is performed through scripts. If you need a CAPTCHA or other content that needs to be entered manually, you can execute the login process by sending and receiving email.

Download example

In order to explain conveniently, we use the word data of mydict to explain the collection of website data that need to be logged in. This mydict sample program can be downloaded from the open source website (https://github.com/TheGoldData/mydict/releases, or https://gitee.com/golddata/mydict/attach_files).

After downloading, open the command line and run the following command to start the example program.

Java-jar mydict.war

After startup, open a browser and enter the URL http://localhost:8080/ to open a login page. As shown in the following figure:

Enter your user name and password (both admin) to open the home word list.

Write login and check session scripts

Click "Collection Management" website Administration, click the "add" button to add a site named mydict. As follows:

Next, configure the login and check the session script, and click "set semi-automatic login", which will open the site semi-automatic login configuration page, as shown below:

The login script is as follows:

/ / send ajax request verification code var va=$ajax ('http://localhost:8080/code/vcode?timestamp=1554001708730',{encoding:false});var arg_= {label:site.name+ "verification code", type:1,content:va.content} / / waitForInput built-in function will send email and wait for input / / (reply email, or goldData platform input), / / and return the input as a verification code. Var code=waitForInput (arg_); var data= "username=admin&password=admin&vcode=" + codevar m=new Map () m.put ('Cookie',va.cookie) / / send an ajax request to execute login var content=$ajax (' http://localhost:8080/doLogin',{method:'POST',headers:m,data:data})// will return status 1 (login successful) if correct, and headers information to GoldData,// otherwise return 0 (login failed)! If (content.headers) {m.putAll (content.headers)} var ret= {status:1,headers:m} if (content.statusrequests 200) {ret.status=0} ret

The check script is as follows:

Var ret=true;if (html.contains ("my word-login") {ret=false} ret

After the configuration, we go back to the website administration page, click "start login", and we will start to execute "automatic login". After that, click "query" and press the twist to refresh the page, and you can see the status of "waiting for input". As shown in the following figure:

At this point, the notification mailbox you set up should also receive an email at the same time. Click on the email, or click the "enter waiting for input" button on the page, and you will see the following:

According to the content of the email, reply to the email "{{qcxe}}" to allow the program to continue. Enter "qcxe" on the golddata page and the effect is the same. The program will return to "waitForInput ()" and return the input.

After replying, we will click "query" on the golddata page to refresh the page, and the login status of mydict will change to "logged in". As shown in the following figure:

Next, we can define the fetching rules.

Define crawling rules

Before adding rules, we also need to define a dataset similar to the table structure. As shown in the following figure:

Next, click "Collection Management", add rules, and open the add rules page, as shown in the following figure:

The script for crawling rules is as follows:

[{_ _ sample: http://localhost:8080/word/index?pageNum=2 match0: http\:\ /\ / localhost\: 8080\ / word\ / index (\? pageNum=\ d +)? Fields0: {_ _ model: true _ _ dataset: word _ _ node: "# content ul > li" sn: {expr: "" attr: "" js: md5 (item.name) _ _ label: "_ _ showOnList: false _ _ type:"down:" 0 "accessPathJs : "" uploadConf: S1} name: {expr: H6 attr: "" js: "" _ _ label: "" _ showOnList: true _ type: "" down: "0" accessPathJs: "" uploadConf: S1} uk: { Expr: li span.uk attr: "" js: source.replace ("uk:" '') _ _ label: "_ _ showOnList: false _ _ type:"down:" 0 "accessPathJs:"uploadConf: S1} us: {expr: li span.us attr:"js: source.replace (" us: " '') _ _ label: "_ _ showOnList: false _ type:"down:" 0 "accessPathJs:"uploadConf: S1}} fields1: {_ _ node: .pagination a href: {expr: an attr: abs:href js:" _ _ label: "" _ _ showOnList: false _ _ type: "" down: "0" accessPathJs: "" uploadConf: S1}}]

Then click Test, and the test crawl will be carried out. We found that the data was indeed captured, as shown in the following figure:

Configuration grabber grab

This is the same as before, set the crawler to grab the site "mydict". Then click to start crawling. Then view the crawled data in the data management.

Conclusion

The essence of GoldData semi-automatic login is to provide a framework that can be manually intervened to obtain the session asynchronously. You can not only call the AI interface to achieve fully automatic login, but also send and receive cookie or token information directly to the GoldData platform by e-mail, no matter how complex the CAPTCHA is, so that the action of grabbing data by GoldData can continue.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report