Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

C # how to implement crawler program

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of c # how to implement the crawler program, the content is detailed and easy to understand, the operation is simple and fast, and has a certain reference value. I believe you will gain something after reading this article on how to realize the crawler program. Let's take a look.

Figure 1

As shown in figure 1, in the course of our work, no matter the platform website or the corporate website, there is always a news display. As the product manager told us one day, promoters want to grab the hot news section of Baidu news to improve the site Baidu ranking. To grab the hot news version of Baidu, we first need to understand the site https://news.baidu.com/ request header (Request headers) information.

Why know the request header (Request headers) information?

The reason is that we can get response data (Response data) successfully by pretending that it is a normal HTTP request rather than a man-made crawler to evade site blocking according to the request header information.

How to view Baidu news URL request header information?

Figure 2

As shown in figure 2, we can open Google browser or other browser development tools (press F12) to view the request header message for this site. You can see from the figure that the Baidu news site can accept data types such as text/html; the language is Chinese; the browser version is Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36 and other message information, which is carried directly when we initiate an HTTP request. Of course, not every message information parameter must be carried in the past, carrying some of it can be successfully requested.

So what is response data (Response data)?

Figure 3

As shown in figure 3, the response data (Response data) can be seen from Google browser or other browser development tools (press F12). The response can be json data or DOM tree data, which is convenient for us to parse the data later.

Of course, you can learn any development language to develop crawlers: C #, NodeJs, Python, Java, C++.

But here is mainly about the development of crawlers in C #. Microsoft provides us with two HttpWebRequest,HttpWebResponse objects for HTTP requests so that we can send requests to get data. The following shows the C# HTTP request code:

Private string RequestAction (RequestOptions options) {string result = string.Empty; IWebProxy proxy = GetProxy (); var request = (HttpWebRequest) WebRequest.Create (options.Uri); request.Accept = options.Accept / / when using curl as POST, when the data to POST is more than 1024 bytes, curl will not directly initiate a POST request. Instead, it will be divided into two steps: / / send a request containing an Expect: 100-continue, asking Server to accept data / / after receiving a 100-continue response returned by Server Not all Server will answer 100-continue correctly before giving data POST to Server / /. For example, lighttpd will return 417 "Expectation Failed", which will cause logic error. Request.ServicePoint.Expect100Continue = false; request.ServicePoint.UseNagleAlgorithm = false;// prohibit Nagle algorithm to speed up loading if (! string.IsNullOrEmpty (options.XHRParams)) {request.AllowWriteStreamBuffering = true;} else {request.AllowWriteStreamBuffering = false;}; / / disable buffering to speed up loading request.Headers.Add (HttpRequestHeader.AcceptEncoding, "gzip,deflate") / / define gzip compressed page support request.ContentType = options.ContentType;// define document type and encoding request.AllowAutoRedirect = options.AllowAutoRedirect;// disable automatic jump request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36" / / set User-Agent, disguised as Google Chrome browser request.Timeout = options.Timeout;// define request timeout of 5 seconds request.KeepAlive = options.KeepAlive;// enable persistent connection if (! string.IsNullOrEmpty (options.Referer)) request.Referer = options.Referer;// return upper-level history link request.Method = options.Method / / define the request method as GET if (proxy! = null) request.Proxy = proxy;// set proxy server IP, disguise the request address if (! string.IsNullOrEmpty (options.RequestCookies)) request.Headers [HttpRequestHeader.Cookie] = options.RequestCookies; request.ServicePoint.ConnectionLimit = options.ConnectionLimit / / define the maximum number of connections if (options.WebHeader! = null & & options.WebHeader.Count > 0) request.Headers.Add (options.WebHeader); / / add header information if (! string.IsNullOrEmpty (options.XHRParams)) / / if it is a POST request, add POST data {byte [] buffer = Encoding.UTF8.GetBytes (options.XHRParams) If (buffer! = null) {request.ContentLength = buffer.Length; request.GetRequestStream () .Write (buffer, 0, buffer.Length) }} using (var response = (HttpWebResponse) request.GetResponse ()) {/ get request response / / foreach (Cookie cookie in response.Cookies) / / options.CookiesContainer.Add (cookie) / / add Cookie to the container Save login status if (response.ContentEncoding.ToLower (). Contains ("gzip")) / / extract {using (GZipStream stream = new GZipStream (response.GetResponseStream (), CompressionMode.Decompress)) {using (StreamReader reader = new StreamReader (stream)) Encoding.UTF8) {result = reader.ReadToEnd () } else if (response.ContentEncoding.ToLower (). Contains ("deflate")) / / decompress {using (DeflateStream stream = new DeflateStream (response.GetResponseStream ()) CompressionMode.Decompress) {using (StreamReader reader = new StreamReader (stream, Encoding.UTF8)) {result = reader.ReadToEnd () } else {using (Stream stream = response.GetResponseStream ()) / / original {using (StreamReader reader = new StreamReader (stream) Encoding.UTF8) {result = reader.ReadToEnd () } request.Abort (); return result;}

There is also a custom parameter passing object. Of course, both incoming and outgoing objects are defined according to your actual business needs:

Public class RequestOptions {/ request method, GET or POST / public string Method {get; set;} / URL / public Uri Uri {get; set;} / public string Referer {get; set } / timeout (milliseconds) / public int Timeout = 15000; / / enable persistent connection / public bool KeepAlive = true; / disable automatic jump / public bool AllowAutoRedirect = false / define the maximum number of connections / public int ConnectionLimit = int.MaxValue; / number of requests / public int RequestNum = 3; / public string Accept = "* / *" / content type / public string ContentType = "application/x-www-form-urlencoded"; / instantiate header information / private WebHeaderCollection header = new WebHeaderCollection () / header information / public WebHeaderCollection WebHeader {get {return header;} set {header = value;}} / define the request Cookie string / public string RequestCookies {get; set } / Asynchronous parameter data / public string XHRParams {get; set;}}

According to the code shown, we can find that many Request headers message parameters are encapsulated in the HttpWebRequest object. We can set them in the HttpWebRequest object provided by Microsoft according to the Request headers information of the website (see the code message parameter comments, write the relevant parameter description, if you understand it wrong, please let it know, thank you), and then send a request to obtain the Response data parsing data.

In addition, crawlers can use proxy IP and it is better to use proxy IP, which reduces the probability of being blocked and improves crawling efficiency. But proxy IP is also divided into quality levels, for some HTTPS sites, it may require a better quality level of proxy IP to penetrate, here I will not deviate from the topic, I will write an article on the quality level of proxy IP to explain my views in detail.

How does the C # code use the proxy IP?

The Microsoft NET framework also provides us with a System.Net.WebProxy object that uses proxy IP. The code for use is as follows:

Private System.Net.WebProxy GetProxy () {System.Net.WebProxy webProxy = null; try {/ / proxy link address plus port string proxyHost = "192.168.1.1"; string proxyPort = "9030" / / account and password for proxy authentication / / string proxyUser = "xxx"; / / string proxyPass = "xxx"; / / set proxy server webProxy = new System.Net.WebProxy () / / set the proxy address plus port webProxy.Address = new Uri (string.Format ("{0}: {1}", proxyHost, proxyPort)); / / if you only set the proxy IP plus port, such as 192.168.1.1 proxyHost 80, and comment this code directly here, you do not need to set the account and password submitted to the proxy server for authentication. / / webProxy.Credentials = new System.Net.NetworkCredential (proxyUser, proxyPass);} catch (Exception ex) {Console.WriteLine ("get agent information exception", DateTime.Now.ToString (), ex.Message);} return webProxy;}

I also explained the parameter description of the System.Net.WebProxy object in the code.

If you get Response data data in json,xml and other formats, we won't go into details about this type of data parsing method here. Please do it yourself. Here is mainly about DOM tree HTML data parsing, for this type of data some people will use regular expressions to parse, but also some people use components. Of course, as long as you can get the data you want, you can parse it any way. The main point here is that I often use the parsing component HtmlAgilityPack, referencing DLL as (using HtmlAgilityPack). The parsing code is as follows:

HtmlDocument htmlDoc = new HtmlDocument (); htmlDoc.LoadHtml (simpleCrawlResult.Contents); HtmlNodeCollection liNodes = htmlDoc.DocumentNode.SelectSingleNode ("/ / div [@ id='pane-news']") .SelectSingleNode ("div [1] / ul [1]") .SelectNodes ("li"); if (liNodes! = null & & liNodes.Count > 0) {for (int I = 0) I < liNodes.Count; [I]) {string title = liNodes [I] .SelectSingleNode ("strong [1] / a [1]"). InnerText.Trim (); string href = liNodes [I] .SelectSingleNode ("strong [1] / a [1]") .GetAttributeValue ("href", ") .SelectSingleNode () Console.WriteLine ("News title:" + title + ", link:" + href);}} "

The following mainly shows the crawling results.

Figure 4

As shown in figure 4, the crawling effect, a simple crawler program is completed like this.

This is the end of the article on "how to implement the crawler program in c #". Thank you for your reading! I believe you all have a certain understanding of "how to realize the crawler program". If you want to learn more, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report