Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use .NET Core to write crawlers to climb the movie paradise

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article shows you how to use .NET Core to write crawlers to climb the movie paradise, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Since the last project migrated from .NET to .NET core, it took a month to officially launch the new version.

Then recently opened a new hole, created a crawler to climb dy2018 movie paradise above the movie resources. Here is also a brief introduction to how to write a crawler based on .NET Core. Preparatory work (.NET Core preparation)

First of all, you must install the .NET Core first. Download and install the tutorial here: https://www.jb51.net/article/87907.htm https://www.jb51.net/article/88735.htm. Whether you are Windows, linux or mac, you can play.

My environment here is: Windows10 + VS2015 community updata3 + .NET Core 1.1.0 SDK + .NET Core 1.0.1 tools Preview 2.

In theory, you only need to install the .NET Core 1.1.0 SDK to develop a .NET Core program, and it doesn't matter what tool you use to write code.

After installing the above tools, you can see the template for the .NET Core in the new project in VS2015. As shown below:

For simplicity, when we create it, we directly select the template that comes with the VS .NET Core tools.

A web page for analyzing the self-cultivation of a reptile

Before we write about crawlers, we first need to understand the composition of the web page data that we are about to crawl.

When it comes to web pages, it is to analyze what tags or tags are used to capture the data in the HTML, and then use this tag to extract the data from the HTML. In my case, I use more of the ID and CSS attributes of the HTML tag.

Take the dy2018.com that this article wants to crawl as an example to briefly describe this process. The dy2018.com home page is shown below:

In chrome, press F12 to enter developer mode, then use the mouse to select the corresponding page data as shown below, and then analyze the HTML composition of the page.

Then we start to analyze the page data:

After a simple analysis of HTML, we come to the following conclusions:

The movie data of the home page of www.dy2018.com is stored in a div tag with class as co_content222

The movie details link is a tag, the label shows the text is the movie name, and URL is the details URL.

So to sum up, our job is to find the div tag of class='co_content222' and extract all the a tag data from it.

Start writing code...

Previously, I used the AngleSharp library when writing a project, a DLL component developed specifically for parsing xHTML source code based on .NET (C #).

The AngleSharp home page is here: https://anglesharp.github.io/

Details: https://www.jb51.net/article/99082.htm

Nuget address: Nuget AngleSharp installation command: Install-Package AngleSharp

Get movie list data

Private static HtmlParser htmlParser = new HtmlParser (); private ConcurrentDictionary _ cdMovieInfo = new ConcurrentDictionary () PrivatevoidAddToHotMovieList () {/ / this operation does not block other current operations, so use Task / / _ cdMovieInfo as a thread-safe dictionary to store all the movie data of the current period Task.Factory.StartNew (() = > {try {/ / get HTML var htmlDoc = HTTPHelper.GetHTMLByURL through URL ("http://www.dy2018.com/");)" / / HTML is parsed into IDocument var dom = htmlParser.Parse (htmlDoc); / / div tags of all class='co_content222' are extracted from dom / / QuerySelectorAll method accepts selector syntax var lstDivInfo = dom.QuerySelectorAll ("div.co_content222") If (lstDivInfo! = null) {/ / the first three DIV are the new movie foreach (var divInfo in lstDivInfo.Take (3)) {/ / get all the a tags in div and the filter of the "/ I /" / Contains ("/ I /") condition in the a tags is because it was found in the test The a tag in this div may be the advertising link divInfo.QuerySelectorAll ("a") .Where (a = > a.GetAttribute ("href"). Contains ("/ I /")) .ToList () .ForEach (a = > {/ / spliced into a complete link var onlineURL = "http://www." Dy2018.com "+ a.GetAttribute (" href ") / / check whether it already exists in the existing data if (! _ cdMovieInfo.ContainsKey (onlineURL)) {/ / get details of the movie MovieInfo movieInfo = FillMovieInfoFormWeb (a, onlineURL) / / the download link is not empty before adding to the existing data if (movieInfo.XunLeiDownLoadURLList! = null & & movieInfo.XunLeiDownLoadURLList.Count! = 0) {_ cdMovieInfo.TryAdd (movieInfo.Dy2018OnlineUrl,movieInfo);}) } catch (Exception ex) {}});}

Get movie details

PrivateMovieInfoFillMovieInfoFormWeb (AngleSharp.Dom.IElement a, string onlineURL) {var movieHTML = HTTPHelper.GetHTMLByURL (onlineURL); var movieDoc = htmlParser.Parse (movieHTML); / / see above for the analysis process of http://www.dy2018.com/i/97462.html. / / the detailed introduction of the movie is var zoom = movieDoc.GetElementById ("Zoom") in the tag with id as Zoom. / / download link in bgcolor='#fdfddf' 's td, there may be multiple links var lstDownLoadURL = movieDoc.QuerySelectorAll ("[bgcolor='#fdfddf']"); / / release time is var updatetime = movieDoc.QuerySelector ("span.updatetime") in class='updatetime' 's span tag; var pubDate = DateTime.Now If (updatetimekeeper updated & null &! string.IsNullOrEmpty (updatetime.InnerHtml)) {/ / content with the words "release time:", / / replace becomes "" and then converted. The conversion failure does not affect the process DateTime.TryParse ("release time:", "), out pubDate). } var movieInfo = new MovieInfo () {/ / InnerHtml may also contain a font tag, do an extra Replace MovieName = a.InnerHtml.Replace (",") .replace (",") .replace (",") .replace (","), Dy2018OnlineUrl = onlineURL, MovieIntro = zoom! = null? WebUtility.HtmlEncode (zoom.InnerHtml): "No introduction yet." / / there may be no profile, although it seems unlikely that XunLeiDownLoadURLList = lstDownLoadURL! = null? LstDownLoadURL.Select (d = > d.FirstElementChild.InnerHtml) .ToList (): null, / / there may be no download link PubDate = pubDate,}; return movieInfo;}

HTTPHelper

There is a small hole here. The dy2018 web page encoding format is GB2312,.NET Core, which does not support GB2312 by default. An exception will be thrown when using Encoding.GetEncoding ("GB2312").

The solution is to manually install the System.Text.Encoding.CodePages package (Install-Package System.Text.Encoding.CodePages)

Then add Encoding.RegisterProvider (CodePagesEncodingProvider.Instance) to the Configure method of Starup.cs, and then you can use Encoding.GetEncoding ("GB2312") normally.

Using System;using System.Net.Http;using System.Net.Http.Headers;using System.Text;namespace Dy2018Crawler {public class HTTPHelper {publicstatic HttpClient Client {get;} = new HttpClient (); publicstaticstringGetHTMLByURL (stringurl) {try {System.Net.WebRequest wRequest = System.Net.WebRequest.Create (url); wRequest.ContentType = "text/html; charset=gb2312"; wRequest.Method = "get"; wRequest.UseDefaultCredentials = true; / / Get the response instance. Var task = wRequest.GetResponseAsync (); System.Net.WebResponse wResp = task.Result; System.IO.Stream respStream = wResp.GetResponseStream (); / / dy2018 this website is encoded in GB2312, using (System.IO.StreamReader reader = new System.IO.StreamReader (respStream, Encoding.GetEncoding ("GB2312")) {return reader.ReadToEnd ();}} catch (Exception ex) {Console.WriteLine (ex.ToString () Return string.Empty;}

Implementation of timing tasks

I am using Pomelo.AspNetCore.TimedJob for scheduled tasks.

Pomelo.AspNetCore.TimedJob is a timing task job library implemented by .NET Core, which supports millisecond timing tasks, reading timing configuration from database, synchronous asynchronous timing tasks and so on.

By the .NET Core community god and former Microsoft MVP AmamiyaYuuko (after joining Microsoft, he stepped down as MVP.) Development and maintenance, but there seems to be no open source, ask back to see if you can open source.

Various versions are available on nuget and can be picked up on demand. Address: https://www.nuget.org/packages/Pomelo.AspNetCore.TimedJob/1.1.0-rtm-10026

The author's own introduction: Timed Job-Pomelo expansion pack series

Startup.cs related code

If I use it here, I must first install the corresponding package: Install-Package Pomelo.AspNetCore.TimedJob-Pre

Then add Service to the ConfigureServices function of Startup.cs and Use it in the Configure function.

/ / This method gets called by the runtime. Use this method to add services to the container.publicvoidConfigureServices (IServiceCollection services) {/ / Add framework services. Services.AddMvc (); / / AddTimedJob services services.AddTimedJob ();} publicvoidConfigure (IApplicationBuilder app, IHostingEnvironment env, ILoggerFactory loggerFactory) {/ / use TimedJob app.UseTimedJob (); if (env.IsDevelopment ()) {app.UseDeveloperExceptionPage (); app.UseBrowserLink ();} else {app.UseExceptionHandler ("/ Home/Error");} app.UseStaticFiles (); app.UseMvc (routes = > {routes.MapRoute (name: "default", template: "{controller=Home} / {action=Index} / {id?}")) }); Encoding.RegisterProvider (CodePagesEncodingProvider.Instance);}

Job related code

Then create a new class, which is clearly XXXJob.cs, and the reference namespace using Pomelo.AspNetCore.TimedJob,XXXJob inherits from Job, adding the following code.

Public class AutoGetMovieListJob:Job {/ / Begin start time; Interval execution interval (in milliseconds). It is recommended to use the following format, which is 3 hours here; / / whether SkipWhileExecuting waits for the last execution to be completed, true to wait [Invoke (Begin = "2016-11-29 22:10", Interval = 1000 * 3600room3, SkipWhileExecuting = true)] publicvoidRun () {/ / Job logic code / / LogHelper.Info ("Start crawling"); / / AddToLatestMovieList (100); / / AddToHotMovieList (); / / LogHelper.Info ("Finish crawling");}}

New runtimes nodes related to project release

Using the new template project created by VS2015, the project.json configuration has no runtimes node by default.

When we want to publish to a non-Windows platform, we need to manually configure this node to generate.

"runtimes": {"win7-x64": {}, "win7-x86": {}, "osx.10.10-x64": {}, "osx.10.11-x64": {}, "ubuntu.14.04-x64": {}}

Delete / comment scripts node

The node.js script is called to build the front-end code when it is generated, which does not guarantee the existence of bower in every environment. The notes are over.

/ / "scripts": {/ / "prepublish": ["bower install", "dotnet bundle"], / / "postpublish": ["dotnet publish-iis-publish-folder% publish:OutputPath%-framework% publish:FullTargetFramework%"] / /}

Delete / comment the type in the dependencies node

"dependencies": {"Microsoft.NETCore.App": {"version": "1.1.0" / / "type": "platform"}

The configuration instructions for project.json can be found in this official document: Project.json-file

Or teacher Zhang Shanyou's article. Net Core Series: 2. What kind of medicine does project.json sell in this gourd?

Development, compilation and release

/ / restore various package files dotnet restore;// publish to C:\ code\ website\ Dy2018Crawler folder dotnet publish-r ubuntu.14.04-x64-c Release-o "C:\ code\ website\ Dy2018Crawler"

Finally, open source as usual.

The above content is how to use .NET Core to write crawlers to climb movie paradise. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report