How to implement a simple crawler with .net core 04/27 Update SLTechnology News&Howtos

How to implement a simple crawler with .net core

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article focuses on "how to use. Net core to achieve a simple crawler", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use. Net core to achieve a simple crawler.

one。 Introduce a Http request framework, HttpCode.Core

HttpCode.Core comes from HttpCode (portal). The difference is that HttpCode.Core is based on. Net standard 2.0. it removes the api coupling between HttpCode and windows, and modifies the asynchronous implementation. the rest of the features are exactly the same as HttpCode. If you have any questions in use, you can check the online documentation (portal).

HttpCode.Core is completely open source and has been uploaded to github, address: https://github.com/stulzq/HttpCode.Core

In order to facilitate everyone to use, also sent to nuget, address: https://www.nuget.org/packages/HttpCode.Core/, in the nuget search HttpCode.Core or execute the command Install-Package HttpCode.Core can be used.

For specific usage, you can consult the online documentation, or check github.

A simple, easy-to-use and efficient open source .net Http request framework!

two。 Analyze the crawl address

First, use the developer tool of Google browser to grab the home page of the blog park to get the address of the list of blog posts:

From which we can analyze:

1. Request address https://www.cnblogs.com/mvc/AggSite/PostList.aspx

two。 Request method Post

3. Request data

{

"CategoryType": "SiteHome"

"ParentCategoryId": 0

CategoryId: 808

"PageIndex": 3

"TotalPostCount": 4000

"ItemListActionName": "PostList"

}

In the request data, we should be concerned that PageIndex represents the number of pages, and we can get the data of different pages by changing the value of this parameter.

Let's first try to get the data using HttpCode.Core:

Int pageIndex = 1 position / number of pages

HttpHelpers httpHelpers=new HttpHelpers ()

HttpItems items=new HttpItems ()

Items.Url = "https://www.cnblogs.com/mvc/AggSite/PostList.aspx";// request address

Items.Method = "Post"; / / request method post

Items.Postdata = "{\" CategoryType\ ":\" SiteHome\ "," +

"\" ParentCategoryId\ ": 0," +

"\" CategoryId\ ": 808," +

"\" PageIndex\ ":" + pageIndex + "," +

"\" TotalPostCount\ ": 4000," +

"\" ItemListActionName\ ":\" PostList\ "}"; / / request data

HttpResults hr = httpHelpers.GetHtml (items)

Console.WriteLine (hr.Html)

Console.ReadKey ()

Run the screenshot:

We can see that we have successfully obtained the data, which proves that our analysis is correct.

three。 Parse the returned data

The data we just returned from the test interface shows that what is returned is a bunch of html strings. We only want the title, author, address and other information of the blog post, we do not need extra html strings, so let's use HtmlAgilityPack, a component that parses web pages, to get the data we want.

With regard to the use of this component, there are already many documents about this component in the blog Park, which you can search and check. Using this component requires knowledge of xpath, so I won't describe it in detail here.

1. First install the HtmlAgilityPack components through nuget

Open the package console

Execute the command Install-Package HtmlAgilityPack-Version 1.5.2-beta6

2. Parse the returned data

Post some of the returned data:

Realize the rights management of the website on September 3, 2017

Now each enterprise management website has to manage the rights of the login account, and it is very important. What each account can see is very different. Here is a way to achieve this function. Requirements: permissions: permissions are the capabilities of functional modules in the user's operating system, such as "role management" module, "tariff management" module and "bill management" module. By specifying permissions, the user's actions can be limited to the specified scope.

Loseheart

Posted on 21:34 on 2017-09-03

Comments (0) read (354)

It is not difficult to see that each data is distinguished by the div of class=post_item, and the address and title of the blog we want are in the div of class=post_item_body in this div, and so on:

Blog title | | h4 | a | Text

Blog address | | h4 | a | href

.. and so on

Because HtmlAgilityPack parses the web page through xpath, so now we have to write xpath according to the path we analyzed above. If you don't understand xpath here, you can go to w3cschool to learn. It's very simple.

The following is the code I wrote to parse the title, address and author of the blog post. For other information, you can refer to it and have a try:

/ / parsing data

HtmlDocument doc=new HtmlDocument ()

/ / load html

Doc.LoadHtml (hr.Html)

/ / get the div list of class=post_item_body

HtmlNodeCollection itemNodes = doc.DocumentNode.SelectNodes ("div [@ class='post_item'] / div [@ class='post_item_body']")

/ / Loop to parse the data we want according to each div

Foreach (var item in itemNodes)

{

/ / get the a tag containing the title and address of the blog post

Var nodeA = item.SelectSingleNode ("h4gama")

/ / get the title of blog post

String title = nodeA.InnerText

/ / get the href attribute of the blog address a tag

String url = nodeA.GetAttributeValue ("href", "")

/ / get the a tag containing the author's name

Var nodeAuthor = item.SelectSingleNode ("div [@ class='post_item_foot'] / a [@ class='lightblue']")

String author = nodeAuthor.InnerText

Console.WriteLine ($"title: {title} | author: {author} | address: {url}")

}

Run the screenshot:

four。 Loop crawling multiple pages

Earlier, we analyzed that the PageIndex in the request parameters is the number of pages, and we also wrote the code for analyzing a single page, so we can increase the number of pages in a loop to achieve the requirements of grabbing different paging data.

Post the complete code

Int pageIndex = 1 position / number of pages

Int maxPageIndex = 10 / maximum number of pages

HttpHelpers httpHelpers=new HttpHelpers ()

For (int I = 0; I < maxPageIndex; iTunes +)

{

HttpItems items = new HttpItems ()

Items.Url = "https://www.cnblogs.com/mvc/AggSite/PostList.aspx";// request address

Items.Method = "Post"; / / request method post

Items.Postdata = "{\" CategoryType\ ":\" SiteHome\ "," +

"\" ParentCategoryId\ ": 0," +

"\" CategoryId\ ": 808," +

"\" PageIndex\ ":" + (iTun1) + "," + / / since I starts at 0, we will add 1 here.

"\" TotalPostCount\ ": 4000," +

"\" ItemListActionName\ ":\" PostList\ "}"; / / request data

HttpResults hr = httpHelpers.GetHtml (items)

/ / parsing data

HtmlDocument doc = new HtmlDocument ()

/ / load html

Doc.LoadHtml (hr.Html)

/ / get the div list of class=post_item_body

HtmlNodeCollection itemNodes = doc.DocumentNode.SelectNodes ("div [@ class='post_item'] / div [@ class='post_item_body']")

Console.WriteLine ($"{item1} data:")

/ / Loop to parse the data we want according to each div

Foreach (var item in itemNodes)

{

/ / get the a tag containing the title and address of the blog post

Var nodeA = item.SelectSingleNode ("h4gama")

/ / get the title of blog post

String title = nodeA.InnerText

/ / get the href attribute of the blog address a tag

String url = nodeA.GetAttributeValue ("href", "")

/ / get the a tag containing the author's name

Var nodeAuthor = item.SelectSingleNode ("div [@ class='post_item_foot'] / a [@ class='lightblue']")

String author = nodeAuthor.InnerText

/ / output data

Console.WriteLine ($"title: {title} | author: {author} | address: {url}")

}

/ / pause for three seconds for every page of data captured

Thread.Sleep (3000)

}

Console.ReadKey ()

Run the screenshot:

A simple. Net core implementation of a simple crawler is complete!

At this point, I believe you have a deeper understanding of "how to use. Net core to achieve a simple crawler". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.