Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

An example Analysis of the difficult problem of Network data acquisition in R language

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces you to the R language network data capture problem example analysis, the content is very detailed, interested friends can refer to, I hope to help you.

Purely from the logic of data fetching (not to mention the available frameworks in engineering), I personally feel that RCurl and httr in the existing request library in R language can completely compare urllib and reuqests in Python (of course py is more professional in error handling and parsing framework!).

We often use network data capture requirements, no more than two kinds:

or fake browser requests.

Either drive browser request

For fake browser requests, although there are many types in the request definition, in fact, the crawler uses nothing more than GET requests and POST requests.

The driver browser has almost no threshold, WYSIWYG, RSelenium/Rwebdriver in R language and Selenium in Python can be completed (configuration is more troublesome).

Data visualization of crawler

GET request parameters are allowed to be written in the URL, but usually more parameters, direct spelling url is very elegant, and RCurl, httr provide optional GET request submission methods. In RCurl, getURL is usually used to complete GET requests without parameters (or parameters spelled directly into the URL), while getForm() is usually used to complete GET requests with parameters. (The parameters are written in the param parameter body).

Pyhon series--interesting live course capture actual combat

R language crawler actual combat--Zhihu live course data crawling actual combat

The GET function in httr also completes the GET request, and the query parameter is submitted as the specified request parameter (also optionally written in the URL).

For POST requests, as a common request method of APIs (some APIs are also sent through GET requests), POST requests are very complex, and their query parameters must be included in the request body (body), and the specified encoding method (content-type in the request header) is required before sending the parameters.

There are four ways to code:

application/x-www-form-urlencoded

application/json

multipart/form-data

text/xml

If you want to understand these four ways in depth, you can refer to the following two articles, or go to the professional http protocol and browser related content.

http://www.cnblogs.com/111testing/p/6079565.html

https://bbs.125.la/thread-13743350-1-1.html

The above four parameters, I have only practiced the first two, the third needs to upload files, have not encountered, the fourth is rare. In the POST function of the RCurl package, only explicit parameter declarations are made for the first and third types.

style=httppost, post, but the second and fourth style parameters are not listed. httr is very friendly in parameter processing, directly specifying the above four common ways:

Left hand R right hand Python series--Simulation landing educational system

R language crawler actual combat--Netease cloud classroom data analysis course plate data crawling

To know that today's web front-end, the use of json as the api returned by the data package is too common, this problem has been bothering me, even once thought that the POST method of the RCurl package does not support uploading json parameters (but RCurl is directly connected to liburl, the general crawler C language library, urllib is also, httr bottom layer is using RCurl, httr can do RCurl naturally).

The author must have hidden the way to upload the json parameter, or he had not had time to package it into a high-level function and put it at the bottom, otherwise it would not be explained. Until today, browsed linkedlin above a great god wrote essay, suddenly inspiration suddenly appeared, try quickly, and it worked! Verification of the previous idea, probably when RCurl first debuted, json has not become mainstream, so json pass parameters are not obviously placed in the POST method parameters of style. The httr package cleverly states how all POST parameters are encoded (Hadley is one step ahead of humanity).

http://www.linkedin.com/pulse/web-data-acquisition-structure-rcurl-request-part-2-roberto-palloni

The following is the purpose of writing this article, to use the RCurl package to construct POST requests, as well as submit json string parameters of the case and code to share with you. Compared with httr, RCurl library is at the bottom level, with many functions and cumbersome functions. httr is more dexterous, portable and concise. This relationship is very similar to urlib and request in Python.

Build headers and query parameters:

library("RCurl")

library("jsonlite")

library("magrittr")headers

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report