In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "how to use the wget command under the Linux system". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how to use the wget command under the Linux system".
A brief introduction to Linux wget
Wget is a download tool for the command line on linux. This is a free software under the GPL license. Linux wget supports HTTP and FTP protocols, supports proxy server and breakpoint resume function, automatically recurses the directory of the remote host, finds qualified files and downloads them to the local hard disk; if necessary, Linux wget will properly convert the hyperconnection in the page to generate a browsable image locally. Because there is no interactive interface, Linux wget can run in the background, intercept and ignore HANGUP signals, so it can continue to run after the user logs in. Typically, Linux wget is used to download files on Internet sites in bulk, or to create mirrors of remote sites.
II. Examples
Download the home page of 192.168.1.168 and display the download information Linux wget-d http://192.168.1.168 download the home page of 192.168.1.168 without displaying any information wget-Q http://192.168.1.168 download all files wget-I filelist.txt with links contained in the filelist.txt
Download to the specified directory wget-P/tmp ftp://user:passwd@url/file and download the file file to the / tmp directory. Linux wget is a command line download tool. For us Linux users, we use it almost every day. Here are some useful Linux wget tips that can make you use Linux wget more efficiently and flexibly.
*
The code is as follows:
$wget-r-np-nd http://example.com/packages/
This command downloads all the files in the packages directory on the http://example.com website. Where the function of-np is not to traverse the parent directory, and-nd means that the directory structure is not recreated locally.
*
The code is as follows:
$wget-r-np-nd-- accept=iso http://example.com/centos-5/i386/
Similar to the previous command, but with an additional-- accept=iso option, which instructs Linux wget to download only all files in the i386 directory with the extension iso. You can also specify multiple extensions, separated by commas.
*
The code is as follows:
$wget-I filename.txt
This command is often used in bulk downloads. Put the addresses of all the files you need to download into filename.txt, and Linux wget will automatically download all the files for you.
*
The code is as follows:
$wget-c http://example.com/really-big-file.iso
The-c option specified here serves as a breakpoint continuation.
*
The code is as follows:
$wget-m-k (- H) http://www.example.com/
This command can be used to mirror a website, and Linux wget will convert the links. If the images in the site are placed on another site, you can use the-H option.
III. Parameters
Code:
The code is as follows:
$wget-- helpGNU Wget 1.9.1
A non-interactive network file download tool Usage: Linux wget [option]. [URL]... The parameters that the long option must use are also necessary when using the short option.
Start:
-V,-- version displays the version of Wget and exits.
-h,-- help prints this help.
-b,-background starts and goes into the background operation.
-e,-execute=COMMAND runs the command in the form of '.wgetrc'.
Log records and input files:
-o,-- the output-file= file writes log messages to the specified file.
-a,-- the append-output= file appends log messages to the end of the specified file.
-d,-- debug prints debug output.
-Q,-- quiet quiet mode (no information output).
-v,-- verbose detailed output mode (default).
-nv,-- non-verbose turns off verbose output mode, but does not enter quiet mode.
-I,-- input-file= file downloads the URL found in the specified file.
-F,-- force-html processes the input file in HTML mode.
-B,-- base=URL adds the specified URL before the relative link when using the-F-I file option.
Download:
-t,-- tries= number of times configure the number of retries (0 means unlimited).
-- retry-connrefused tries again even if the connection is rejected.
-O-- the output-document= file writes data to this file.
-nc,-- no-clobber does not change existing files, nor does it write new files by adding. # (# is a number) to the file name.
-c,-- continue continues to receive files that have been partially downloaded.
-- choose how to represent the download progress in progress= mode.
-N,-- timestamping will not retrieve the remote file unless it is newer.
-S,-- server-response displays the server response message.
Spider does not download any data.
-T,-- timeout= seconds configure the timeout for reading data (in seconds).
-w,-- the number of seconds wait= waits between receiving different files.
-- waitretry= seconds wait a short period of time between each retry (ranging from 1 second to the specified number of seconds).
-- wait a short period of time between different files received by random-wait (from 0 seconds to 2*WAIT seconds).
-Y,-- proxy=on/off turns on or off the proxy server.
-Q,-- quota= size configures the limit size of the received data.
-- the bind-address= address uses the local specified address (hostname or IP) to connect.
-- limit-rate= rate limits the rate of downloads.
-- dns-cache=off forbids finding DNS stored in the cache.
-- restrict-file-names=OS restricts the characters in the file name to those allowed by the specified OS (operating system).
Table of contents:
-nd-- no-directories does not create directories.
-x,-- force-directories forces the creation of a directory.
-nH,-- no-host-directories does not create a directory with a remote host name.
-P,-- directory-prefix= name creates a directory with the specified name before saving the file.
-- the number of cut-dirs= ignores the specified number of directory layers in the remote directory.
HTTP options:
-- http-user= user configures http user name.
-- http-passwd= password configure http user password.
-C,-- cache=on/off (not) uses the data in the cache in the server (default is used).
-E,-- html-extension adds an. Html extension to all files of type MIME text/html.
-- ignore-length ignores the "Content-Length" header field.
-- header= string adds the specified string to the file header.
-- proxy-user= user configures the proxy server username.
-- proxy-passwd= password configure proxy server user password.
-- referer=URL includes the "Referer:URL" header in the HTTP request.
-s,-- save-headers stores the HTTP header in the file.
-U,-- the user-agent=AGENT flag is AGENT instead of Wget/VERSION.
No-http-keep-alive disables HTTP keep-alive (persistent connection).
-cookies=off disables cookie.
Load the cookie from the specified file before the load-cookies= file session starts.
-- saves the cookie to the specified file after the save-cookies= file session ends.
The post-data= string uses the POST method to send the specified string.
The post-file= file uses the POST method to send the contents of the specified file.
HTTPS (SSL) option:
-- optional client-side certificate for sslcertfile= file.
-- sslcertkey= key file optional key file for this certificate.
-- egd-file= file EGD socket file name.
-- sslcadir= directory the directory where the CA hash table is located.
-- the sslcafile= file contains the file of CA.
-- sslcerttype=0/1 Client-Cert type 0=PEM (default) / 1=ASN1 (DER)
-- sslcheckcert=0/1 checks the server's certificate according to the CA provided
-- sslprotocol=0-3 Select SSL protocol; 0 = automatically select
1=SSLv2 2=SSLv3 3=TLSv1
FTP options:
-nr,-- dont-remove-listing does not delete the ".clients" file.
-g,-- glob=on/off sets whether to expand the file name with wildcards.
-- passive-ftp uses "passive" transfer mode.
-- retr-symlinks in recursive mode, download the file indicated by the link (except to a directory).
Recursive download:
-r,-- recursive recursive download.
-l,-- level= numeric maximum recursive depth (inf or 0 for infinity).
-- delete-after deletes downloaded files.
-k,-- convert-links converts absolute links to relative links.
-K,-- backup-converted back up file X as X.orig before converting it.
-m,-- mirror is equivalent to the option of-r-N-l inf-nr.
-p,-- page-requisites downloads all the files needed to display the full web page, such as images.
-- strict-comments opens the SGML handling option for HTML comments.
Options for accepting / rejecting recursive downloads:
-A,-- list of file styles accepted by the accept= list, separated by commas.
-R,-- A comma-separated list of file styles excluded by the reject= list.
-D,-- A comma-separated list of domains accepted by the domains= list.
-- A comma-separated list of domains excluded by the exclude-domains= list.
-- follow-ftp follows the FTP link in the HTML file.
-- follow-tags= list of HTML tags to follow, separated by commas.
-G,-- ignore-tags= lists HTML tags to ignore, separated by commas.
-H,-- span-hosts can enter other hosts when recursive.
-L,-- relative only follows relative links.
-I,-- include-directories= list the list of directories to download.
-X,-- exclude-directories= list the list of directories to exclude.
-np,-- no-parent does not search upper-level directories.
IV. Example: batch downloading files on remote FTP servers with Wget
I bought a VPS yesterday and migrated the virtual host to VPS. The migration process must be to transfer data. The previous mode of virtual host data migration is very inefficient. The old host is packaged and downloaded-> the new host is uploaded and decompressed. Due to the very low bandwidth of the home network and the constant uplink rate of 512kbps of ADSL for thousands of years, the previous migration site is definitely manual work.
Now with VPS and shell, the process is extremely simple. With the help of the large bandwidth of the computer room, it seems like a pleasure for the computer room to transmit files to each other directly.
All right, here's how to do it:
1. The old virtual host is packaged to back up the whole site site.tar.gz
2. Use wget to download site.tar.gz from the old virtual host in the shell of VPS, and use the FTP protocol
The code is as follows:
Wget-- ftp-user=username-- ftp-password=password-m-nh ftp://xxx.xxx.xxx.xxx/xxx/xxx/site.tar.gz
Wget-- ftp-user=username-- ftp-password=password-r-m-nh ftp://xxx.xxx.xxx.xxx/xxx/xxx/*
Above is the command, FTP username password parameters are not explained
-r is optional, indicating recursive download. This parameter is required if you download the entire directory directly.
-m indicates mirror image and does not explain
-nh means that the stack of hierarchical directories will not be generated and will be displayed directly from the current directory. Very good parameters.
This is followed by the address of ftp, and the * after the slash indicates that all the files in this directory are downloaded. If it is only one file, enter the file name directly.
5. Qyoga
a. All major versions of the wget tool linux come with the download tool Linux wget. Bash $wget http://place.your.url/here it can also control ftp to download all levels of directories of the entire web site, of course, if you are not careful, you may download the whole site and other sites that link with him. Bash$ wget-m http://target.web.site/subdirectory because this tool has a strong download ability So it can be used as a tool for mirroring websites on the server. Let it follow the rules of "robots.txt". There are many parameters to control how it is mirrored correctly, you can limit the type of link, the type of download file, and so on. For example: download only linked links and ignore GIF images:
The code is as follows:
Bash$ wget-m-L-reject=gif http://target.web.site/subdirectory
Linux wget can also implement breakpoint continuation (- c parameter), which, of course, needs to be supported by a remote server.
The code is as follows:
Bash$ wget-c http://the.url.of/incomplete/file
Breakpoint continuation and mirroring can be combined to continue mirroring a site with a large number of selective files if it has been broken many times before. How to automatically achieve this goal will be discussed more later.
If you think that downloading will affect your office, you can limit the number of Linux wget retries.
The code is as follows:
Bash$ wget-t 5 http://place.your.url/here
I gave up after five retries. Use the "- t inf" parameter to indicate never give up. Try again and again.
B. what about the agency service? You can use the parameters of the http agent or specify a way to download through the agent in the .wgetrc configuration file. However, there is a problem that there may be several failures if the breakpoint continuation is carried out through an agent. If there is an interruption in the download process through the proxy, the complete copy of the file is saved in the cache on the proxy server. So when you use "wget-c" to download the rest, the proxy server looks at its cache and mistakenly thinks you've downloaded the whole file. So the wrong signal was sent. At this point, you can urge the proxy server to clear their cache by adding a specific request parameter:
The code is as follows:
Bash$ wget-c-header= "Pragma: no-cache" http://place.your.url/here
This "- header" parameter can be added in a variety of numbers and in various ways. It allows us to change some properties of the web server or proxy server. Some sites do not provide externally connected file services, and content is submitted only through other pages on the same site. At this time, you can add "Referer:" parameter: bash$ wget-header= "Referer: http://coming.from.this/page" http://surfing.to.this/page some special websites only support a specific browser, this time you can use the" User-Agent: "parameter
The code is as follows:
Bash$ wget-header= "User-Agent: Mozilla/4.0 (compatible; MSIE 5.0 Taiwan windows NT; DigExt)" http://msie.only.url/here
C. how do I set the download time?
If you need to download large files on your office computer through a shared connection with other colleagues, and you hope that your colleagues will not be affected by the slowing down of the Internet, then you should try to avoid rush hours. Of course, you don't have to wait for everyone to leave in the office, and you don't have to think about downloading it online once after dinner at home. Working hours can be well customized with at: bash$ at 23:00warning: commands will be executed using / bin/shat > wget http://place.your.url/hereat> press Ctrl-D so we set the download to take place at 11:00 in the evening. In order for this arrangement to work properly, make sure that the atd daemon is running.
D. does it take a long time to download?
When you need to download a lot of data and you don't have enough bandwidth, you will often find that the day's work is about to start again before your scheduled download task is completed.
As a good colleague, you can only stop these tasks and start another job. Then you need to repeat "wget-c" over and over again to complete your download. This must be too tedious, so it's best to automate it with crontab. Create a plain text file called "crontab.txt" that contains the following: 023 * * 1-5 wget-c-N http://place.your.url/here0 6 * * 1-5 killall wgetz this crontab file specifies that certain tasks are performed on a regular basis. The first five columns declare when to execute the command, while the rest of each line tells crontab what to execute.
The first two columns specify that Linux wget downloads start at 11:00 in the evening and stop all Linux wget downloads at 6am. The * in the third and fourth column indicates that the task is performed every day of each month. The fifth column specifies which days of the week to execute the program. -"1-5" means from Monday to Friday. So at 11:00 in the evening on every working day, the download starts, and by 6: 00 in the morning, any Linux wget tasks are stopped. You can execute it with the following command
The code is as follows:
Crontab:bash$ crontab crontab.txt
This "- N" parameter of Linux wget will check the timestamp of the target file, and if it matches, the downloader will stop because it indicates that the entire file has been downloaded completely. Use "crontab-r" to delete this schedule. I have used this method many times, and I have downloaded a lot of ISO images through shared phone dialing, which is quite practical.
E. how to download dynamic web pages
Some web pages change several times a day according to requirements. So technically, the target is no longer a file, it has no file length. Therefore, the parameter "- c" has no meaning. For example: a linux weekend news page written by PHP and constantly changing:
The code is as follows:
Bash$ wget http://lwn.net/bigpage.php3
The network conditions in my office are often very poor, which brings a lot of trouble to my download, so I wrote a simple script to check whether the dynamic page has been completely updated.
The code is as follows:
#! / bin/bash
# create it if absent
Touch bigpage.php3
# check if we got the whole thing
While! Grep-qi bigpage.php3
Do
Rm-f bigpage.php3
# download LWN in one big page
Wget http://lwn.net/bigpage.php3
Done
This script ensures that the page will continue to be downloaded until "" appears in the page, which means that the file has been completely updated.
F. What about ssl and Cookies?
If you want to access the Internet through ssl, the website address should start with "https://"." In this case, you need another download tool, called curl, which is easily available. Some websites force netizens to use cookie when browsing. So you have to get the parameter "Cookie:" from the Cookie you get on the site. Only in this way can we ensure that the download parameters are correct. For Cookie file formats for lynx and Mozilla, use the following:
The code is as follows:
Bash$ cookie=$ (grep nytimes ~ / .lynx_cookies | awk {printf ("% sails% s;", $6 lynx_cookies 7)})
You can construct a request Cookie to download the content on the http://www.nytimes.com. Of course, you have already used this browser to register on the site. W3m uses a different, smaller Cookie file format:
The code is as follows:
Bash$ cookie=$ (grep nytimes ~ / .w3m/cookie | awk {printf ("% sails% s;", $2)})
You can now download it in this way:
The code is as follows:
Bash$ wget-header= "Cookie: $cookie" http://www.nytimes.com/reuters/technology/tech-tech-supercomput.html
Or use the curl tool:
The code is as follows:
Bash$ curl-v-b $cookie-o supercomp.html http://www.nytimes.com/reuters/technology/tech-tech-supercomput.htm
G. how to create an address list?
So far we have downloaded a single file or the entire website. Sometimes we need to download a large number of linked files on a web page, but there is no need to mirror the entire site. For example, we want to download the first 20 songs from a sequence of 100 songs. Note that the "- accept" and "- reject" parameters do not work here because they only work on file operations. So be sure to use the "lynx-dump" parameter instead.
The code is as follows:
Bash$ lynx-dump ftp://ftp.ssc.com/pub/lg/ | grep gz$ | tail-10 | awk {print $2} > urllist.txt
The output of lynx can be overestimated by various GNU text processing tools. In the above example, our link address ends with "gz" and puts the last 10 file addresses in the urllist.txt file. Then we can write a simple bash script to automatically download the target file in this file:
The code is as follows:
Bash$ for x in $(cat urllist.txt)
> do
> wget $x
> done
In this way we can successfully download the Linux Gazette website (the latest 10 topics on ftp://ftp.ssc.com/pub/lg/).
H. expand the bandwidth used
If you choose to download a file with limited bandwidth, your download will be slow due to server-side restrictions. The following technique will greatly shorten the download process. But this technique requires you to use curl and the remote server has multiple images available for download. For example, suppose you want to download Mandrake 8.0 from the following three addresses:
The code is as follows:
Url1= http://ftp.eecs.umich.edu/pub/linux/mandrake/iso/Mandrake80-inst.iso
Url2= http://ftp.rpmfind.net/linux/Mandrake/iso/Mandrake80-inst.iso
Url3= http://ftp.wayne.edu/linux/mandrake/iso/Mandrake80-inst.iso
The length of this file is 677281792 bytes, so use the curl program with the "- range" parameter to create three simultaneous downloads:
The code is as follows:
Bash$ curl-r 0-199999999-o mdk-iso.part1 $url1 &
Bash$ curl-r 200000000-399999999-o mdk-iso.part2 $url2 &
Bash$ curl-r 400000000-o mdk-iso.part3 $url3 &
This creates three background processes. Each process transfers different parts of the ISO file from a different server. This "- r" parameter specifies the byte range of the target file. When these three
At the end of the process, connect the three files together with a simple cat command-cat mdk-iso.part? > mdk-80.iso. (it is strongly recommended to check the md5 before engraving)
You can also use the "- verbose" parameter to make each curl process have its own window to display the transfer process.
At this point, I believe you have a deeper understanding of "how to use the wget command under the Linux system". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.