Wget

Wget Module

The Rezo Wget module is a full-featured wget clone for Node.js that provides recursive site downloading, DOM-based asset extraction, link conversion for offline browsing, and comprehensive filtering. It uses the Rezo HTTP client internally for all network requests, giving you access to cookies, proxies, stealth profiles, and every other Rezo feature during downloads.

import { Wget } from 'rezo/wget';

The Wget Class

Wget is the primary entry point. It accepts nested configuration options and exposes a fluent (method-chaining) API for building download jobs.

Constructor Options

import { Wget } from 'rezo/wget';

const wg = new Wget({
  recursive: {
    enabled: true,
    depth: 3,
    pageRequisites: true,
    convertLinks: true,
  },
  filter: {
    domains: ['example.com'],
    acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
    followTags: ['link', 'img', 'style'],
    noParent: true,
  },
  download: {
    outputDir: './mirror',
    adjustExtension: true,
    wait: 0.5,
    timeout: 30,
    tries: 3,
  },
  organizeAssets: true,
  extractInternalStyles: true,
});

Fluent API

Every configuration option has a corresponding chainable method. This lets you build download jobs incrementally:

const stats = await new Wget()
  .concurrency(10)
  .convertLinks()
  .removeJavascript(true)
  .pageRequisites()
  .noRobots()
  .domains('example.com', 'cdn.example.com')
  .outputDir('./output')
  .organizeAssets(true)
  .extractInternalStyles(true)
  .get('https://example.com/');

Core Methods

.get(url) / .getAll(urls)

Download a single URL or multiple URLs. Returns a WgetStats object when the download completes.

// Single page and its requisites
const stats = await wg.get('https://example.com/page.html');

// Multiple starting URLs
const stats = await wg.getAll([
  'https://example.com/',
  'https://example.com/about',
  'https://example.com/docs',
]);

.concurrency(n)

Set the number of parallel downloads. The Downloader uses a RezoQueue internally for workflow orchestration while Rezo’s built-in queue handles HTTP-level concurrency.

const wg = new Wget().concurrency(10);

After downloading, rewrite all URLs in HTML and CSS files to relative paths so the site works offline. Implements wget’s -k / --convert-links flag.

await new Wget()
  .convertLinks()
  .pageRequisites()
  .outputDir('./offline')
  .get('https://example.com/');

// Result: <link href="https://example.com/style.css">
//      -> <link href="./style.css">

.pageRequisites()

Download all assets required to render each page: stylesheets, images, fonts, scripts, favicons, and manifests. Equivalent to wget’s -p flag. This also allows downloads from sibling subdomains (e.g. cdn.example.com when crawling example.com).

await new Wget()
  .pageRequisites()
  .get('https://example.com/');

.removeJavascript(enabled)

Strip all <script> tags and inline event handlers (onclick, onload, etc.) from downloaded HTML. Useful for creating clean offline archives.

await new Wget()
  .removeJavascript(true)
  .get('https://example.com/');

.noRobots()

Ignore robots.txt restrictions. By default the Downloader respects robots.txt via the RobotsHandler class. Pass this to disable it.

await new Wget().noRobots().get('https://example.com/');

.domains(...domains)

Restrict downloads to the listed domains. Supports subdomain matching — specifying example.com also allows www.example.com and cdn.example.com.

await new Wget()
  .domains('example.com', 'static.example.com')
  .get('https://example.com/');

.outputDir(path)

Set the output directory where all downloaded files are saved. Defaults to the current working directory.

await new Wget()
  .outputDir('./site-mirror')
  .get('https://example.com/');

.organizeAssets(enabled)

When enabled, the AssetOrganizer sorts downloaded assets into categorized folders:

output/
  css/          # Stylesheets
  js/           # JavaScript
  images/       # Images (PNG, JPG, SVG, WebP, etc.)
  fonts/        # Web fonts (WOFF, WOFF2, TTF, etc.)
  audio/        # Audio files
  video/        # Video files
  assets/       # Everything else
  index.html    # HTML documents keep their path structure

HTML documents are never reorganized — they keep their original URL-based path.

await new Wget()
  .organizeAssets(true)
  .get('https://example.com/');

.extractInternalStyles(enabled)

Extract inline <style> tags into separate CSS files and replace them with <link> references. The StyleExtractor names files based on the style tag’s id, name, or class attribute, falling back to a numeric index.

await new Wget()
  .extractInternalStyles(true)
  .organizeAssets(true)
  .get('https://example.com/');

// Before: <style id="theme">body { color: red; }</style>
// After:  <link rel="stylesheet" href="./css/internal.theme.css">

Events

The Wget class emits events throughout the download lifecycle. Attach handlers with .on():

const wg = new Wget()
  .outputDir('./mirror')
  .on('start', (event) => {
    console.log(`Starting download of ${event.urls.length} URLs`);
  })
  .on('progress', (event) => {
    console.log(`${event.percent}% - ${event.url}`);
  })
  .on('download', (event) => {
    console.log(`Downloaded: ${event.url} -> ${event.localPath}`);
  })
  .on('skip', (event) => {
    console.log(`Skipped: ${event.url} (${event.reason})`);
  })
  .on('error', (event) => {
    console.log(`Failed: ${event.url}`, event.error);
  })
  .on('complete', (event) => {
    console.log(`Done! ${event.stats.filesWritten} files written`);
  });

await wg.get('https://example.com/');

Event Types

EventDescription
startFired when the download job begins
progressPeriodic progress updates with percentage and current URL
downloadFired after each file is successfully saved
skipFired when a URL is skipped (filtered, duplicate, robots, etc.)
errorFired on download failures (the job continues)
completeFired when all downloads finish
link-conversionFired during the link conversion phase

Stats Tracking

Every download returns a WgetStats object:

interface WgetStats {
  urlsDiscovered: number;    // Total URLs found
  urlsDownloaded: number;    // Successfully downloaded
  urlsSkipped: number;       // Filtered or duplicated
  urlsFailed: number;        // Failed downloads
  filesWritten: number;      // Files saved to disk
  bytesDownloaded: number;   // Total bytes transferred
  startTime: number;         // Timestamp when started
  endTime: number;           // Timestamp when finished
  duration: number;          // Total time in milliseconds
}
const stats = await new Wget()
  .pageRequisites()
  .outputDir('./mirror')
  .get('https://example.com/');

console.log(`Downloaded ${stats.filesWritten} files`);
console.log(`Total: ${(stats.bytesDownloaded / 1024 / 1024).toFixed(2)} MB`);
console.log(`Duration: ${(stats.duration / 1000).toFixed(1)}s`);

Complete Example

Mirror a website for offline browsing with organized assets, extracted styles, and no JavaScript:

import { Wget } from 'rezo/wget';

const stats = await new Wget({
  organizeAssets: true,
  extractInternalStyles: true,
  download: {
    adjustExtension: true,
  },
  filter: {
    acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
    followTags: ['link', 'img', 'style'],
  },
})
  .concurrency(10)
  .convertLinks()
  .removeJavascript(true)
  .pageRequisites()
  .noRobots()
  .domains('example.com')
  .outputDir('./offline-site')
  .on('error', (e) => console.error(`Failed: ${e.url}`, e.error))
  .on('progress', (e) => console.log(`${e.percent}%`))
  .getAll([
    'https://example.com/',
    'https://example.com/docs',
  ]);

console.log(`Mirror complete: ${stats.filesWritten} files, ${stats.urlsFailed} errors`);

Internal Architecture

The Wget module is composed of several specialized classes:

ClassFileRole
Wgetsrc/wget/index.tsPublic API, fluent builder, event routing
Downloadersrc/wget/downloader.tsCore engine: queue, retry, orchestration
AssetExtractorsrc/wget/asset-extractor.tsDOM-based URL extraction from HTML/CSS/XML/JS
AssetOrganizersrc/wget/asset-organizer.tsFolder categorization and hash deduplication
UrlFiltersrc/wget/url-filter.tsDomain, pattern, depth, directory filtering
LinkConvertersrc/wget/link-converter.tsPost-download URL rewriting for offline use
StyleExtractorsrc/wget/style-extractor.tsInline style extraction to CSS files
FileWritersrc/wget/file-writer.tsDisk I/O with collision handling
RobotsHandlersrc/wget/robots.tsrobots.txt parsing and enforcement
ResumeHandlersrc/wget/resume.tsPartial download resumption
ProgressReportersrc/wget/progress.tsProgress calculation and events
DownloadCachesrc/wget/download-cache.tsIn-memory download deduplication