In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article focuses on "how to use. Net core to achieve a simple crawler", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use. Net core to achieve a simple crawler.
one。 Introduce a Http request framework, HttpCode.Core
HttpCode.Core comes from HttpCode (portal). The difference is that HttpCode.Core is based on. Net standard 2.0. it removes the api coupling between HttpCode and windows, and modifies the asynchronous implementation. the rest of the features are exactly the same as HttpCode. If you have any questions in use, you can check the online documentation (portal).
HttpCode.Core is completely open source and has been uploaded to github, address: https://github.com/stulzq/HttpCode.Core
In order to facilitate everyone to use, also sent to nuget, address: https://www.nuget.org/packages/HttpCode.Core/, in the nuget search HttpCode.Core or execute the command Install-Package HttpCode.Core can be used.
For specific usage, you can consult the online documentation, or check github.
A simple, easy-to-use and efficient open source .net Http request framework!
two。 Analyze the crawl address
First, use the developer tool of Google browser to grab the home page of the blog park to get the address of the list of blog posts:
From which we can analyze:
1. Request address https://www.cnblogs.com/mvc/AggSite/PostList.aspx
two。 Request method Post
3. Request data
{
"CategoryType": "SiteHome"
"ParentCategoryId": 0
CategoryId: 808
"PageIndex": 3
"TotalPostCount": 4000
"ItemListActionName": "PostList"
}
In the request data, we should be concerned that PageIndex represents the number of pages, and we can get the data of different pages by changing the value of this parameter.
Let's first try to get the data using HttpCode.Core:
Int pageIndex = 1 position / number of pages
HttpHelpers httpHelpers=new HttpHelpers ()
HttpItems items=new HttpItems ()
Items.Url = "https://www.cnblogs.com/mvc/AggSite/PostList.aspx";// request address
Items.Method = "Post"; / / request method post
Items.Postdata = "{\" CategoryType\ ":\" SiteHome\ "," +
"\" ParentCategoryId\ ": 0," +
"\" CategoryId\ ": 808," +
"\" PageIndex\ ":" + pageIndex + "," +
"\" TotalPostCount\ ": 4000," +
"\" ItemListActionName\ ":\" PostList\ "}"; / / request data
HttpResults hr = httpHelpers.GetHtml (items)
Console.WriteLine (hr.Html)
Console.ReadKey ()
Run the screenshot:
We can see that we have successfully obtained the data, which proves that our analysis is correct.
three。 Parse the returned data
The data we just returned from the test interface shows that what is returned is a bunch of html strings. We only want the title, author, address and other information of the blog post, we do not need extra html strings, so let's use HtmlAgilityPack, a component that parses web pages, to get the data we want.
With regard to the use of this component, there are already many documents about this component in the blog Park, which you can search and check. Using this component requires knowledge of xpath, so I won't describe it in detail here.
1. First install the HtmlAgilityPack components through nuget
Open the package console
Execute the command Install-Package HtmlAgilityPack-Version 1.5.2-beta6
2. Parse the returned data
Post some of the returned data:
0
Realize the rights management of the website on September 3, 2017
Now each enterprise management website has to manage the rights of the login account, and it is very important. What each account can see is very different. Here is a way to achieve this function. Requirements: permissions: permissions are the capabilities of functional modules in the user's operating system, such as "role management" module, "tariff management" module and "bill management" module. By specifying permissions, the user's actions can be limited to the specified scope.
Loseheart
Posted on 21:34 on 2017-09-03
Comments (0) read (354)
It is not difficult to see that each data is distinguished by the div of class=post_item, and the address and title of the blog we want are in the div of class=post_item_body in this div, and so on:
Blog title | | h4 | a | Text
Blog address | | h4 | a | href
.. and so on
Because HtmlAgilityPack parses the web page through xpath, so now we have to write xpath according to the path we analyzed above. If you don't understand xpath here, you can go to w3cschool to learn. It's very simple.
The following is the code I wrote to parse the title, address and author of the blog post. For other information, you can refer to it and have a try:
/ / parsing data
HtmlDocument doc=new HtmlDocument ()
/ / load html
Doc.LoadHtml (hr.Html)
/ / get the div list of class=post_item_body
HtmlNodeCollection itemNodes = doc.DocumentNode.SelectNodes ("div [@ class='post_item'] / div [@ class='post_item_body']")
/ / Loop to parse the data we want according to each div
Foreach (var item in itemNodes)
{
/ / get the a tag containing the title and address of the blog post
Var nodeA = item.SelectSingleNode ("h4gama")
/ / get the title of blog post
String title = nodeA.InnerText
/ / get the href attribute of the blog address a tag
String url = nodeA.GetAttributeValue ("href", "")
/ / get the a tag containing the author's name
Var nodeAuthor = item.SelectSingleNode ("div [@ class='post_item_foot'] / a [@ class='lightblue']")
String author = nodeAuthor.InnerText
Console.WriteLine ($"title: {title} | author: {author} | address: {url}")
}
Run the screenshot:
four。 Loop crawling multiple pages
Earlier, we analyzed that the PageIndex in the request parameters is the number of pages, and we also wrote the code for analyzing a single page, so we can increase the number of pages in a loop to achieve the requirements of grabbing different paging data.
Post the complete code
Int pageIndex = 1 position / number of pages
Int maxPageIndex = 10 / maximum number of pages
HttpHelpers httpHelpers=new HttpHelpers ()
For (int I = 0; I < maxPageIndex; iTunes +)
{
HttpItems items = new HttpItems ()
Items.Url = "https://www.cnblogs.com/mvc/AggSite/PostList.aspx";// request address
Items.Method = "Post"; / / request method post
Items.Postdata = "{\" CategoryType\ ":\" SiteHome\ "," +
"\" ParentCategoryId\ ": 0," +
"\" CategoryId\ ": 808," +
"\" PageIndex\ ":" + (iTun1) + "," + / / since I starts at 0, we will add 1 here.
"\" TotalPostCount\ ": 4000," +
"\" ItemListActionName\ ":\" PostList\ "}"; / / request data
HttpResults hr = httpHelpers.GetHtml (items)
/ / parsing data
HtmlDocument doc = new HtmlDocument ()
/ / load html
Doc.LoadHtml (hr.Html)
/ / get the div list of class=post_item_body
HtmlNodeCollection itemNodes = doc.DocumentNode.SelectNodes ("div [@ class='post_item'] / div [@ class='post_item_body']")
Console.WriteLine ($"{item1} data:")
/ / Loop to parse the data we want according to each div
Foreach (var item in itemNodes)
{
/ / get the a tag containing the title and address of the blog post
Var nodeA = item.SelectSingleNode ("h4gama")
/ / get the title of blog post
String title = nodeA.InnerText
/ / get the href attribute of the blog address a tag
String url = nodeA.GetAttributeValue ("href", "")
/ / get the a tag containing the author's name
Var nodeAuthor = item.SelectSingleNode ("div [@ class='post_item_foot'] / a [@ class='lightblue']")
String author = nodeAuthor.InnerText
/ / output data
Console.WriteLine ($"title: {title} | author: {author} | address: {url}")
}
/ / pause for three seconds for every page of data captured
Thread.Sleep (3000)
}
Console.ReadKey ()
Run the screenshot:
A simple. Net core implementation of a simple crawler is complete!
At this point, I believe you have a deeper understanding of "how to use. Net core to achieve a simple crawler". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.