Wget Module

The Rezo Wget module is a full-featured wget clone for Node.js that provides recursive site downloading, DOM-based asset extraction, link conversion for offline browsing, and comprehensive filtering. It uses the Rezo HTTP client internally for all network requests, giving you access to cookies, proxies, stealth profiles, and every other Rezo feature during downloads.

import { Wget } from 'rezo/wget';

The wget module depends on linkedom for HTML parsing. It is bundled as a regular dependency of rezo, so installing rezo is enough — no separate install step is required.

The Wget Class

Wget is the primary entry point. It accepts nested configuration options and exposes a fluent (method-chaining) API for building download jobs.

Constructor Options

import { Wget } from 'rezo/wget';

const wg = new Wget({
  recursive: {
    enabled: true,
    depth: 3,
    pageRequisites: true,
    convertLinks: true,
  },
  filter: {
    domains: ['example.com'],
    acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
    followTags: ['link', 'img', 'style'],
    noParent: true,
  },
  download: {
    outputDir: './mirror',
    adjustExtension: true,
    wait: 0.5,
    timeout: 30,
    tries: 3,
  },
  organizeAssets: true,
  extractInternalStyles: true,
});

Fluent API

Every configuration option has a corresponding chainable method. This lets you build download jobs incrementally:

const stats = await new Wget()
  .concurrency(10)
  .convertLinks()
  .removeJavascript(true)
  .pageRequisites()
  .noRobots()
  .domains('example.com', 'cdn.example.com')
  .outputDir('./output')
  .organizeAssets(true)
  .extractInternalStyles(true)
  .get('https://example.com/');

Core Methods

`.get(url)` / `.getAll(urls)`

Download a single URL or multiple URLs. Returns a WgetStats object when the download completes.

// Single page and its requisites
const stats = await wg.get('https://example.com/page.html');

// Multiple starting URLs
const stats = await wg.getAll([
  'https://example.com/',
  'https://example.com/about',
  'https://example.com/docs',
]);

`.concurrency(n)`

Set the number of parallel downloads. The Downloader uses a RezoQueue internally for workflow orchestration while Rezo’s built-in queue handles HTTP-level concurrency.

const wg = new Wget().concurrency(10);

`.convertLinks()`

After downloading, rewrite all URLs in HTML and CSS files to relative paths so the site works offline. Implements wget’s -k / --convert-links flag.

await new Wget()
  .convertLinks()
  .pageRequisites()
  .outputDir('./offline')
  .get('https://example.com/');

// Result: <link href="https://example.com/style.css">
//      -> <link href="./style.css">

`.pageRequisites()`

Download all assets required to render each page: stylesheets, images, fonts, scripts, favicons, and manifests. Equivalent to wget’s -p flag. This also allows downloads from sibling subdomains (e.g. cdn.example.com when crawling example.com).

await new Wget()
  .pageRequisites()
  .get('https://example.com/');

`.removeJavascript(enabled)`

Strip all <script> tags and inline event handlers (onclick, onload, etc.) from downloaded HTML. Useful for creating clean offline archives.

await new Wget()
  .removeJavascript(true)
  .get('https://example.com/');

`.noRobots()`

Ignore robots.txt restrictions. By default the Downloader respects robots.txt via the RobotsHandler class. Pass this to disable it.

await new Wget().noRobots().get('https://example.com/');

`.domains(...domains)`

Restrict downloads to the listed domains. Supports subdomain matching — specifying example.com also allows www.example.com and cdn.example.com.

await new Wget()
  .domains('example.com', 'static.example.com')
  .get('https://example.com/');

`.outputDir(path)`

Set the output directory where all downloaded files are saved. Defaults to the current working directory.

await new Wget()
  .outputDir('./site-mirror')
  .get('https://example.com/');

`.organizeAssets(enabled)`

When enabled, the AssetOrganizer sorts downloaded assets into categorized folders:

output/
  css/          # Stylesheets
  js/           # JavaScript
  images/       # Images (PNG, JPG, SVG, WebP, etc.)
  fonts/        # Web fonts (WOFF, WOFF2, TTF, etc.)
  audio/        # Audio files
  video/        # Video files
  assets/       # Everything else
  index.html    # HTML documents keep their path structure

HTML documents are never reorganized — they keep their original URL-based path.

await new Wget()
  .organizeAssets(true)
  .get('https://example.com/');

`.extractInternalStyles(enabled)`

Extract inline <style> tags into separate CSS files and replace them with <link> references. The StyleExtractor names files based on the style tag’s id, name, or class attribute, falling back to a numeric index.

await new Wget()
  .extractInternalStyles(true)
  .organizeAssets(true)
  .get('https://example.com/');

// Before: <style id="theme">body { color: red; }</style>
// After:  <link rel="stylesheet" href="./css/internal.theme.css">

Events

The Wget class emits events throughout the download lifecycle. Attach handlers with .on():

const wg = new Wget()
  .outputDir('./mirror')
  .on('start', (event) => {
    console.log(`Starting download of ${event.url}`);
  })
  .on('progress', (event) => {
    console.log(`${event.percent}% - ${event.url}`);
  })
  .on('complete', (event) => {
    console.log(`Downloaded: ${event.url} (${event.size} bytes)`);
  })
  .on('skip', (event) => {
    console.log(`Skipped: ${event.url} (${event.reason})`);
  })
  .on('error', (event) => {
    console.log(`Failed: ${event.url}`, event.error);
  })
  .on('finish', (event) => {
    console.log(`Done! ${event.stats.filesWritten} files written`);
  });

await wg.get('https://example.com/');

Event Types

Event	Description
`start`	Fired when a download begins for a URL
`progress`	Periodic progress updates with percentage and bytes transferred
`headers`	Fired when response headers arrive
`complete`	Fired after each file is successfully saved
`skip`	Fired when a URL is skipped (filtered, duplicate, robots, etc.)
`error`	Fired on download failures (the job continues)
`redirect`	Fired when a redirect is followed
`retry`	Fired when a request is retried
`assets`	Fired when assets are extracted from an HTML/CSS document
`robots`	Fired when robots.txt rules are evaluated for a URL
`convert`	Fired during the link conversion phase
`finish`	Fired when the entire job finishes (carries the `WgetStats` summary)
`queue`	Fired on queue lifecycle events

Stats Tracking

Every download returns a WgetStats object:

interface WgetStats {
  urlsDownloaded: number;    // Successfully downloaded
  urlsSkipped: number;       // Filtered or duplicated
  urlsFailed: number;        // Failed downloads
  filesWritten: number;      // Files saved to disk
  bytesDownloaded: number;   // Total bytes transferred
  startTime: number;         // Timestamp when started
  endTime?: number;          // Timestamp when finished
  duration?: number;         // Total time in milliseconds
  urlMap?: Map<string, string>; // URL → local path mapping
}

const stats = await new Wget()
  .pageRequisites()
  .outputDir('./mirror')
  .get('https://example.com/');

console.log(`Downloaded ${stats.filesWritten} files`);
console.log(`Total: ${(stats.bytesDownloaded / 1024 / 1024).toFixed(2)} MB`);
console.log(`Duration: ${(stats.duration / 1000).toFixed(1)}s`);

Complete Example

Mirror a website for offline browsing with organized assets, extracted styles, and no JavaScript:

import { Wget } from 'rezo/wget';

const stats = await new Wget({
  organizeAssets: true,
  extractInternalStyles: true,
  download: {
    adjustExtension: true,
  },
  filter: {
    acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
    followTags: ['link', 'img', 'style'],
  },
})
  .concurrency(10)
  .convertLinks()
  .removeJavascript(true)
  .pageRequisites()
  .noRobots()
  .domains('example.com')
  .outputDir('./offline-site')
  .on('error', (e) => console.error(`Failed: ${e.url}`, e.error))
  .on('progress', (e) => console.log(`${e.percent}%`))
  .getAll([
    'https://example.com/',
    'https://example.com/docs',
  ]);

console.log(`Mirror complete: ${stats.filesWritten} files, ${stats.urlsFailed} errors`);

Internal Architecture

The Wget module is composed of several specialized classes:

Class	File	Role
`Wget`	`src/wget/index.ts`	Public API, fluent builder, event routing
`Downloader`	`src/wget/downloader.ts`	Core engine: queue, retry, orchestration
`AssetExtractor`	`src/wget/asset-extractor.ts`	DOM-based URL extraction from HTML/CSS/XML/JS
`AssetOrganizer`	`src/wget/asset-organizer.ts`	Folder categorization and hash deduplication
`UrlFilter`	`src/wget/url-filter.ts`	Domain, pattern, depth, directory filtering
`LinkConverter`	`src/wget/link-converter.ts`	Post-download URL rewriting for offline use
`StyleExtractor`	`src/wget/style-extractor.ts`	Inline style extraction to CSS files
`FileWriter`	`src/wget/file-writer.ts`	Disk I/O with collision handling
`RobotsHandler`	`src/wget/robots.ts`	robots.txt parsing and enforcement
`ResumeHandler`	`src/wget/resume.ts`	Partial download resumption
`ProgressReporter`	`src/wget/progress.ts`	Progress calculation and events
`DownloadCache`	`src/wget/download-cache.ts`	In-memory download deduplication

Wget Module

The Wget Class

Constructor Options

Fluent API

Core Methods

.get(url) / .getAll(urls)

.concurrency(n)

.convertLinks()

.pageRequisites()

.removeJavascript(enabled)

.noRobots()

.domains(...domains)

.outputDir(path)

.organizeAssets(enabled)

.extractInternalStyles(enabled)