How to construct the crawler spider program in C # language 07/12 Update SLTechnology News&Howtos

How to construct the crawler spider program in C # language

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of how to construct the crawler spider program of the C # language, the content is detailed and easy to understand, the operation is simple and fast, and has a certain reference value. I believe that after reading this article on how to construct the crawler spider program in C# language, we will gain something. Let's take a look at it.

C # is particularly suitable for building spider programs because it already has built-in HTTP access and multithreading capabilities, both of which are critical to spider programs. Here are the key problems to be solved in constructing a spider program:

⑴ HTML analysis: some kind of HTML parser is needed to analyze every page that the spider program encounters.

⑵ page processing: you need to process each downloaded page. The downloaded content may be saved to disk or further analyzed and processed.

⑶ multithreading: only with multithreading capability can spider programs be truly efficient.

⑷ determines when to finish: don't underestimate this problem, it's not easy to determine whether a task has been completed, especially in a multithreaded environment.

I. HTML parsing

The HTML parser provided in this article is implemented by the ParseHTML class and is very convenient to use: first create an instance of this class, and then set its Source property to the HTML document to parse:

ParseHTML parse = new ParseHTML ()

Parse.Source = "

Hello World

You can then use loops to check all the text and markup contained in the HTML document. Typically, the inspection process can start with a while loop that tests the Eof method:

While (! parse.Eof ()

{

Char ch = parse.Parse ()

The Parse method returns the characters contained in the HTML document-- it returns only those characters that are not HTML tags, and if a HTML tag is encountered, the Parse method returns a value of 0, indicating that a HTML tag has now been encountered. After we encounter a tag, we can use the GetTag () method to deal with it.

If (ch==0)

{

HTMLTag tag = parse.GetTag ()

}

Generally speaking, one of the most important tasks of a spider program is to find out the various HREF attributes, which can be done with the indexing function of C#. For example, the following code will extract the value of the HREF attribute, if it exists.

Attribute href = tag ["HREF"]

String link = href.Value

After you get the Attribute object, you can get the value of this property through Attribute.Value.

Second, deal with HTML pages

Let's take a look at how to deal with HTML pages. The first thing to do, of course, is to download the HTML page, which can be achieved through the HttpWebRequest class provided by C #:

HttpWebRequest request = (HttpWebRequest) WebRequest.Create (m_uri)

Response = request.GetResponse ()

Stream = response.GetResponseStream ()

Next we create a stream stream from request. Before performing other processing, we need to determine whether the file is a binary file or a text file, and different file types are handled differently. The following code determines whether the file is binary.

If (! response.ContentType.ToLower () .StartsWith ("text/")

{

SaveBinaryFile (response)

Return null

}

String buffer = "", line

If the file is not a text file, we read it as a binary file. If it is a text file, first create a StreamReader from stream, and then add the contents of the text file line by line to the buffer.

Reader = new StreamReader (stream)

While (line = reader.ReadLine ())! = null)

{

Buffer+=line+ "rn"

}

After loading the entire file, then save it as a text file.

SaveTextFile (buffer)

Let's take a look at how these two different types of files are stored.

The content type declaration of the binary does not start with "text/". The spider saves the binary directly to disk without additional processing, because the binary does not contain HTML, so there are no more HTML links that need to be processed by the spider. Here are the steps for writing to a binary file.

First prepare a buffer to temporarily hold the contents of the binary file. Byte [] buffer = new byte [1024]

The next step is to determine the path and name where the file is saved locally. If you want to download the contents of a myhost.com site to a local c:test folder, the online path and name of the binaries are at the same time, we need to make sure that the images subdirectory has been created under the c:test directory. This part of the task is accomplished by the convertFilename method.

String filename = convertFilename (response.ResponseUri)

The convertFilename method separates the HTTP address and creates the corresponding directory structure. After determining the name and path of the output file, you can open the input stream that reads the Web page and the output stream that writes to the local file.

Stream outStream = File.Create (filename)

Stream inStream = response.GetResponseStream ()

The contents of the Web file can then be read and written to the local file, which can be easily done through a loop.

Int l

{

L = inStream.Read (buffer,0

Buffer.Length)

If (l > 0)

OutStream.Write (buffer,0,l)

} while (l > 0)

III. Multithreading

We use the DocumentWorker class to encapsulate all operations to download a URL. Every time an instance of DocumentWorker is created, it enters a loop and waits for the next URL to be processed. Here is the main loop of DocumentWorker:

While (! m_spider.Quit)

{

M_uri = m_spider.ObtainWork ()

M_spider.SpiderDone.WorkerBegin ()

String page = GetPage ()

If (pagewise null)

ProcessPage (page)

M_spider.SpiderDone.WorkerEnd ()

}

The loop runs until the Quit tag is set to true (when the user clicks the "Cancel" button, the Quit tag is set to true). Within the loop, we call ObtainWork to get a URL. ObtainWork will wait until a URL is available-- which can only be obtained by other threads parsing the document and looking for a link. The Done class uses the WorkerBegin and WorkerEnd methods to determine when the entire download operation has been completed.

As you can see from figure 1, the spider program allows the user to determine the number of threads to use. In practice, the optimal number of threads is affected by many factors. If your machine has high performance, or if you have two processors, you can set more threads; conversely, if the network bandwidth and machine performance are limited, setting too many threads will not necessarily improve performance.

4. Has the task been completed?

Using multiple threads to download files at the same time effectively improves performance, but it also brings problems in thread management. One of the most complex questions is: when will the spider program finish the job? Here we will judge with the help of a special class Done.

? First of all, it is necessary to explain the specific meaning of "get the job done". The work of the spider program is completed only if there is no URL waiting to be downloaded in the system and all worker threads have finished their processing. In other words, getting the job done means that there is no URL waiting to be downloaded or being downloaded.

The Done class provides a WaitDone method whose function is to wait until the Done object detects that the spider program has finished its work. Here is the code for the WaitDone method.

Public void WaitDone ()

{

Monitor.Enter (this)

While (m_activeThreads > 0)

{

Monitor.Wait (this)

}

Monitor.Exit (this)

}

The WaitDone method waits until there are no more active threads. However, it must be noted that there are no active threads at the beginning of the download, so it is easy to cause the spider program to stop as soon as it starts. To solve this problem, we also need another method, WaitBegin, to wait for the spider program to enter the "formal" working phase. The general call order is: first call WaitBegin, and then call WaitDone,WaitDone will wait for the spider program to finish its work. Here is the code for WaitBegin:

Public void WaitBegin ()

{

Monitor.Enter (this)

While (! m_started)

{

Monitor.Wait (this)

}

Monitor.Exit (this)

}

The WaitBegin method waits until the m_started tag is set. The m_started tag is set by the WorkerBegin method. When the worker thread starts to process each URL, it calls WorkerEnd at the end of the WorkerBegin; process. The WorkerBegin and WorkerEnd methods help the Done object determine its current working state. Here is the code for the WorkerBegin method:

Public void WorkerBegin ()

{

Monitor.Enter (this)

Massively activeThreadsgiving +

M_started = true

Monitor.Pulse (this)

Monitor.Exit (this)

}

The WorkerBegin method first increases the number of currently active threads, then sets the m_started flag, and finally calls the Pulse method to notify (if possible) threads waiting for the worker thread to start. As mentioned earlier, the method that might wait for the Done object is the WaitBegin method. Each time a URL,WorkerEnd method is processed, it is called:

Public void WorkerEnd ()

{

Monitor.Enter (this)

Massively activeThreadsMurray-

Monitor.Pulse (this)

Monitor.Exit (this)

}

The WorkerEnd method reduces the m_activeThreads active thread counter, and calls Pulse to release threads that may be waiting for the Done object-- as mentioned earlier, the method that may be waiting for the Done object is the WaitDone method.

This is the end of this article on "how to construct a crawler spider program in C # language". Thank you for reading! I believe that everyone has a certain understanding of "how to construct the crawler spider program in C # language". If you want to learn more, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.