Pyods: advanced open directory scraper
I'm a long-time reader and participant of the Subreddits r/Datahoarder and r/OpenDirectories. As both of these deal with retrieval and archival of data and information, wget is a wildly popular tool within these communities.
For the unfamiliar, an "open directory" is a directory on a website where there is no index file. Since there's no content to serve, most web servers will generate a page showing a listing of files on the server. Wget is a command-line tool for making various types of HTTP and FTP requests. Specifically, wget is specialized for content mirroring and it works nicely with open directories. That is, following links at the target URL to discover additional links to download with the end goal of downloading an identical mirror of the remote site.
For the most part, wget excels at doing this. It's been around for decades and is supported on nearly every operating system, is easy to find information about online, and is generally bug free.
However, wget suffers from a design flaw - its single threaded operation.
Meet Pyods - the modern, Python, Open Directory, Scraper:
$ pyods --help usage: pyods [-h] [-u URL] [-o OUTPUT_DIR] [-p PARALLEL] [-c] [-d DELAY] [-e EXCLUDE [EXCLUDE ...]] [-f EXCLUDE_FROM] [-v] Open directory scraper optional arguments: -h, --help show this help message and exit -u URL, --url URL url to scrape -o OUTPUT_DIR, --output-dir OUTPUT_DIR destination for downloaded files -p PARALLEL, --parallel PARALLEL number of downloads to execute in parallel -c, --clobber clobber existing files instead of resuming -d DELAY, --delay DELAY delay between requests -e EXCLUDE [EXCLUDE ...], --exclude EXCLUDE [EXCLUDE ...] exclude patterns -f EXCLUDE_FROM, --exclude-from EXCLUDE_FROM exclude patterns from file -v, --verbose enable info logging
Pyods was written, in Python 3, from the ground up with speed and mirroring in mind. In contrast with wget, Pyods downloads URLs in parallel. The options in the help text above show off Pyods' other capabilities such as excluding URLs, adding delay between requests, etc.
An example invocation might look like:
pyods -u http://example.com/some_open_directory/ -o ./output -p 10 --delay 5
This example would use 10 parallel downloads with a delay of 5 seconds between each HTTP requests.
I intend to release this on Pypi eventually. But for now, Pyods is available on my git site. Happy scraping!