Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to build a multi-process CommandlineFu crawler using Shell

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Editor to share with you how to use Shell to build multi-process CommandlineFu crawler, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

CommandlineFu is a website that records snippets of scripts, each with a functional description and a corresponding tag. What I want to do is try to write a multi-process crawler in shell and record these code fragments in an org file.

Parameter definition

This script needs to be able to specify the number of concurrent crawlers with the-n argument (the default is the number of CPU cores) and the saved org file path with-f (default output to stdout).

#! / usr/bin/env bash proc_num=$ (nproc) store_file=/dev/stdoutwhile getopts: OPT; do case f: OPT; do case $OPT in n | + n) proc_num= "$OPTARG";; f | + f) store_file= "$OPTARG" *) echo "usage: ${0clients /} [+-n proc_num] [+-f org_file} [-]" exit 2 esacdoneshift $((OPTIND-1)) OPTIND=1 parsing command browsing page

We need a process to extract the URL of each script fragment from the browse list of CommandlineFu, this process stores the extracted URL into a queue, and then each crawler process reads the URL from the process and extracts the corresponding code snippet, description and tag information into the org file.

Here are three problems:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

How to realize the queue of communication between processes

How to extract URL, code snippet, description, tag and other information from the page

The problem of disorder when multiple processes read and write the same file

Implement the communication queue between processes

This problem is easy to solve, and we can do it through a named pipe:

Queue=$ (mktemp-- dry-run) mkfifo ${queue} exec 99 ${queue} trap "rm ${queue} 2 > / dev/null" EXIT extract the desired information from the page

There are two main ways to extract element content from a page:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

For simple HTML pages, we can extract information from HTML by regular expression matching through sed, grep, awk and other tools.

Use the hxselect in the html-xml-utils toolset to extract the relevant elements based on the CSS selector.

Here we use the html-xml-utils tool to extract:

Function extract_views_from_browse_page () {if [[$#-eq 0]] Then local html=$ (cat -) else local html= "$*" fi echo ${html} | hxclean | hxselect-c-s "\ n"li.list-group-item > div:nth-child (1) > div:nth-child (1) > a:nth-child (1):: attr (href)" | sed's @ ^ @ https://www.commandlinefu.com/@'} function extract_nextpage_from_browse_page () {if [$#-eq 0]] Then local html=$ (cat -) else local html= "$*" fi echo ${html} | hxclean | hxselect-s "\ n"li.list-group-item:nth-child (26) > a" | grep'>'| hxselect-c ":: attr (href)" | sed's @ ^ @ https://www.commandlinefu.com/@'}

It should be noted here that hxselect needs to follow strict XML specifications when parsing HTML, so it needs to be corrected by hxclean before parsing with hxselect. In addition, to prevent the HTML from being too large to exceed the length of the parameter list, the HTML content is allowed to be passed in the form of a pipeline.

Cycle through the browsing page of the next page, and constantly extract code snippets URL to write to the queue

What needs to be solved here is the third problem mentioned above: how to ensure that there is no disorder when multi-processes read and write pipes? To do this, we need to lock the file when writing, and then unlock the file after writing the file. In shell, we can use flock to put shackles on the file. For the usage and considerations of flock, see another blog post on the usage and considerations of Linux shell flock file locks.

Because you need to use the function extract_views_from_browse_page in the flock child process, you need to export the function first:

Export-f extract_views_from_browse_page

Due to network problems, using curl to obtain content may fail and need to be obtained repeatedly:

Function fetch () {local url= "$1" while! Curl-L ${url} 2 > / dev/null;do: done}

Collector is used to grab the URL to be crawled from the seed URL and write it to the pipeline file, which is also used as a lock file during the write operation:

Function collector () {url= "$*" while [[- n ${url}]] Do echo "extracts" html=$ (fetch "${url}") echo "${html}" from $url | flock ${queue}-c "extract_views_from_browse_page > ${queue}" url=$ (echo "${html}" | extract_nextpage_from_browse_page) done # allows the crawler process that parses the code snippet to exit normally without being blocked. For ((iTuno _ share) i$ {queue} done}

Note here that after we can't find the next page of URL, we use a for loop to write = proc_num= blank lines to the queue. The purpose of this step is to allow the crawler process that parses the code snippet to exit normally without being blocked.

Parse script snippet page

We need to extract the title, code snippet, description, and tag information from the page of the script snippet and write it to the storage file in the format of the org schema.

Function view_page_handler () {local url= "$1" local html= "(fetch" ${url} ")" # headline local headline= "$(echo ${html} | hxclean | hxselect-c-s"\ n ".col-md-8 > h2:nth-child (1)") "# command local command=" $(echo ${html} | hxclean | hxselect-c-s "\ n" .col-md-8 > div: Nth-child (2) > span:nth-child (2) "| pandoc-f html-t org)" # description local description= "$(echo ${html} | hxclean | hxselect-c-s"\ n "" .col-md-8 > div.description "| pandoc-f html-t org)" # tags local tags= "$(echo ${html} | hxclean | hxselect-c-s": ".clients > a") if [[- n "${tags}"] Then tags= ": ${tags}" fi # build org content cat

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report