Pyods: advanced open directory scraper

I'm a long-time reader and participant of the Subreddits r/Datahoarder and r/OpenDirectories. As both of these deal with retrieval and archival of data and information, wget is a wildly popular tool within these communities.

For the unfamiliar, an "open directory" is a directory on a website where there is no index file. Since there's no content to serve, most web servers will generate a page showing a listing of files on the server. Wget is a command-line tool for making various types of HTTP and FTP requests. Specifically, wget is specialized for content mirroring and it works nicely with open directories. That is, following links at the target URL to discover additional links to download with the end goal of downloading an identical mirror of the remote site.

For the most part, wget excels at doing this. It's been around for decades and is supported on nearly every operating system, is easy to find information about online, and is generally bug free.

However, wget suffers from a design flaw - its single threaded operation. 

Meet Pyods - the modern, Python, Open Directory, Scraper:

$ pyods --help
usage: pyods [-h] [-u URL] [-o OUTPUT_DIR] [-p PARALLEL] [-c] [-d DELAY]
             [-e EXCLUDE [EXCLUDE ...]] [-f EXCLUDE_FROM] [-v]

Open directory scraper

optional arguments:
  -h, --help            show this help message and exit
  -u URL, --url URL     url to scrape
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        destination for downloaded files
  -p PARALLEL, --parallel PARALLEL
                        number of downloads to execute in parallel
  -c, --clobber         clobber existing files instead of resuming
  -d DELAY, --delay DELAY
                        delay between requests
  -e EXCLUDE [EXCLUDE ...], --exclude EXCLUDE [EXCLUDE ...]
                        exclude patterns
  -f EXCLUDE_FROM, --exclude-from EXCLUDE_FROM
                        exclude patterns from file
  -v, --verbose         enable info logging

Pyods was written, in Python 3, from the ground up with speed and mirroring in mind. In contrast with wget, Pyods downloads URLs in parallel. The options in the help text above show off Pyods' other capabilities such as excluding URLs, adding delay between requests, etc.

An example invocation might look like:

pyods -u http://example.com/some_open_directory/ -o ./output -p 10 --delay 5

This example would use 10 parallel downloads with a delay of 5 seconds between each HTTP requests.

I intend to release this on Pypi eventually. But for now, Pyods is available on my git site. Happy scraping!

Tags: