wget – the Universal Web Retrieval Tool

wget – the Universal Web Retrieval Tool

If you spend a lot of time on the web researching information, you have probably wished you could store some of the HTML pages you find locally on your machine. Sometimes the site you are going to is really slow, but you have to consult it frequently; other times you know you won’t be able to get to the web, but need to information on the road; and in the third and worst case, it is good to have your diagnostic information handy if your connectivity ever goes down.

Wget solves this problem. In its easiest incarnation, you can simply download, say, the Yahoo! homepage by typing:

wget http://www.yahoo.com

You’ll get a file called index.html with the current content (as of download time) for the page you were asking.

Invocation

Wget runs from the command line. Its behavior is determined by the command line arguments and by two initialization files, /usr/local/etc/wgetrc for global initialization and ~/.wgetrc for user specific selections. Options specified on the command line override those in the initialization files.

Wget is fairly fast and always tries to do the ‘right thing’, continuing where it was interrupted and retrieving files even when they are not exactly where the URL specified they would be.

Use Case 1: Downloading a single URL

Note: see below for a complete reference.

As mentioned above, to download a single URL you just type wget followed by the full URL. You can skip the protocol part (http://) if the protocol is HTTP. You can use HTTP, HTTPS and FTP as protocols.

Use Case 2: Downloading a whole directory

Not much more difficult. You simply specify the name of the directory, and the options -r -l 1 -np. If the web server returns a directory listing, then these three options turn on recursion, set the recursion depth at 1, and disallow referencing the parent (-np stands for "no parent". Note that the web server must be able to return the directory listing for this to work.

Example: wget -r -l 1 -np http://diva.homelinux.org/images/icons

Use Case 4: Downloading a whole site

This is actually fairly easy. You specify that you want infinite recursion (-r -l inf) and start at the top. There is a handy short version of this, the option -m (for ‘mirror’).

Example: wget -m http://msdb.microsoft.com/

Use Case 5: Make a local copy of a site

You know the problem: your ISP has a wonderful help site where they explain what to do if your connection is down. Unfortunately, if your connection is down, you can’t get to the site. So it’s good to have a local copy of the site for your reference. Wget makes that easy, by providing the option –convert-links.

Example: wget –convert-links http://help.earthlink.net/connections

Use Case 6: Checking bookmarks

If you are diligent like me, after a while you have hundreds of bookmarks or favorites. When it gets to that point, you soon find out that some of the bookmarks point to nowhere, and it’s time to clean them up. Wget can help there. If you export your bookmarks to a file, say bookmarks.html, you can download (that is, check) every URL in that file, which gives you a good idea of what still works and what doesn’t. To do so, you need to combine the -i option (input) with the –spider option. The latter checks whether the URL still is available, without actually downloading anything.

Example: wget –spider -i bookmarks.html

Use Case 7: Incremental backup of a web site

If you have a web site that needs to be backed up locally once in a while, you probably want to do incremental backups and download only the changes. Wget offers two options to do so, -nc (no clobber) and -N (timestamping). The former will simpy refuse to download a page if it is already present on disk. The latter will check the remote and local copies and download the remote only if it is newer than the local copy.

By using -nc, you ensure a minimum effort backup. The option -N, instead, gives you a realistic picture of your web site.

Example: wget -N -r -l inf http://diva.homelinux.org

Use Case 8: Downloading from a authenticated site

If you have a site that requires authentication, you have two options with wget:

  1. Work with web server authentication using basic or digest authentication methods
  2. Use cookies as identifiers

The methods are implemented completely differently, and we’ll discuss the first one here, the other one in the next use case.

Add a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.