Wget Module
The Rezo Wget module is a full-featured wget clone for Node.js that provides recursive site downloading, DOM-based asset extraction, link conversion for offline browsing, and comprehensive filtering. It uses the Rezo HTTP client internally for all network requests, giving you access to cookies, proxies, stealth profiles, and every other Rezo feature during downloads.
import { Wget } from 'rezo/wget'; The Wget Class
Wget is the primary entry point. It accepts nested configuration options and exposes a fluent (method-chaining) API for building download jobs.
Constructor Options
import { Wget } from 'rezo/wget';
const wg = new Wget({
recursive: {
enabled: true,
depth: 3,
pageRequisites: true,
convertLinks: true,
},
filter: {
domains: ['example.com'],
acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
followTags: ['link', 'img', 'style'],
noParent: true,
},
download: {
outputDir: './mirror',
adjustExtension: true,
wait: 0.5,
timeout: 30,
tries: 3,
},
organizeAssets: true,
extractInternalStyles: true,
}); Fluent API
Every configuration option has a corresponding chainable method. This lets you build download jobs incrementally:
const stats = await new Wget()
.concurrency(10)
.convertLinks()
.removeJavascript(true)
.pageRequisites()
.noRobots()
.domains('example.com', 'cdn.example.com')
.outputDir('./output')
.organizeAssets(true)
.extractInternalStyles(true)
.get('https://example.com/'); Core Methods
.get(url) / .getAll(urls)
Download a single URL or multiple URLs. Returns a WgetStats object when the download completes.
// Single page and its requisites
const stats = await wg.get('https://example.com/page.html');
// Multiple starting URLs
const stats = await wg.getAll([
'https://example.com/',
'https://example.com/about',
'https://example.com/docs',
]); .concurrency(n)
Set the number of parallel downloads. The Downloader uses a RezoQueue internally for workflow orchestration while Rezo’s built-in queue handles HTTP-level concurrency.
const wg = new Wget().concurrency(10); .convertLinks()
After downloading, rewrite all URLs in HTML and CSS files to relative paths so the site works offline. Implements wget’s -k / --convert-links flag.
await new Wget()
.convertLinks()
.pageRequisites()
.outputDir('./offline')
.get('https://example.com/');
// Result: <link href="https://example.com/style.css">
// -> <link href="./style.css"> .pageRequisites()
Download all assets required to render each page: stylesheets, images, fonts, scripts, favicons, and manifests. Equivalent to wget’s -p flag. This also allows downloads from sibling subdomains (e.g. cdn.example.com when crawling example.com).
await new Wget()
.pageRequisites()
.get('https://example.com/'); .removeJavascript(enabled)
Strip all <script> tags and inline event handlers (onclick, onload, etc.) from downloaded HTML. Useful for creating clean offline archives.
await new Wget()
.removeJavascript(true)
.get('https://example.com/'); .noRobots()
Ignore robots.txt restrictions. By default the Downloader respects robots.txt via the RobotsHandler class. Pass this to disable it.
await new Wget().noRobots().get('https://example.com/'); .domains(...domains)
Restrict downloads to the listed domains. Supports subdomain matching — specifying example.com also allows www.example.com and cdn.example.com.
await new Wget()
.domains('example.com', 'static.example.com')
.get('https://example.com/'); .outputDir(path)
Set the output directory where all downloaded files are saved. Defaults to the current working directory.
await new Wget()
.outputDir('./site-mirror')
.get('https://example.com/'); .organizeAssets(enabled)
When enabled, the AssetOrganizer sorts downloaded assets into categorized folders:
output/
css/ # Stylesheets
js/ # JavaScript
images/ # Images (PNG, JPG, SVG, WebP, etc.)
fonts/ # Web fonts (WOFF, WOFF2, TTF, etc.)
audio/ # Audio files
video/ # Video files
assets/ # Everything else
index.html # HTML documents keep their path structure HTML documents are never reorganized — they keep their original URL-based path.
await new Wget()
.organizeAssets(true)
.get('https://example.com/'); .extractInternalStyles(enabled)
Extract inline <style> tags into separate CSS files and replace them with <link> references. The StyleExtractor names files based on the style tag’s id, name, or class attribute, falling back to a numeric index.
await new Wget()
.extractInternalStyles(true)
.organizeAssets(true)
.get('https://example.com/');
// Before: <style id="theme">body { color: red; }</style>
// After: <link rel="stylesheet" href="./css/internal.theme.css"> Events
The Wget class emits events throughout the download lifecycle. Attach handlers with .on():
const wg = new Wget()
.outputDir('./mirror')
.on('start', (event) => {
console.log(`Starting download of ${event.urls.length} URLs`);
})
.on('progress', (event) => {
console.log(`${event.percent}% - ${event.url}`);
})
.on('download', (event) => {
console.log(`Downloaded: ${event.url} -> ${event.localPath}`);
})
.on('skip', (event) => {
console.log(`Skipped: ${event.url} (${event.reason})`);
})
.on('error', (event) => {
console.log(`Failed: ${event.url}`, event.error);
})
.on('complete', (event) => {
console.log(`Done! ${event.stats.filesWritten} files written`);
});
await wg.get('https://example.com/'); Event Types
| Event | Description |
|---|---|
start | Fired when the download job begins |
progress | Periodic progress updates with percentage and current URL |
download | Fired after each file is successfully saved |
skip | Fired when a URL is skipped (filtered, duplicate, robots, etc.) |
error | Fired on download failures (the job continues) |
complete | Fired when all downloads finish |
link-conversion | Fired during the link conversion phase |
Stats Tracking
Every download returns a WgetStats object:
interface WgetStats {
urlsDiscovered: number; // Total URLs found
urlsDownloaded: number; // Successfully downloaded
urlsSkipped: number; // Filtered or duplicated
urlsFailed: number; // Failed downloads
filesWritten: number; // Files saved to disk
bytesDownloaded: number; // Total bytes transferred
startTime: number; // Timestamp when started
endTime: number; // Timestamp when finished
duration: number; // Total time in milliseconds
} const stats = await new Wget()
.pageRequisites()
.outputDir('./mirror')
.get('https://example.com/');
console.log(`Downloaded ${stats.filesWritten} files`);
console.log(`Total: ${(stats.bytesDownloaded / 1024 / 1024).toFixed(2)} MB`);
console.log(`Duration: ${(stats.duration / 1000).toFixed(1)}s`); Complete Example
Mirror a website for offline browsing with organized assets, extracted styles, and no JavaScript:
import { Wget } from 'rezo/wget';
const stats = await new Wget({
organizeAssets: true,
extractInternalStyles: true,
download: {
adjustExtension: true,
},
filter: {
acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
followTags: ['link', 'img', 'style'],
},
})
.concurrency(10)
.convertLinks()
.removeJavascript(true)
.pageRequisites()
.noRobots()
.domains('example.com')
.outputDir('./offline-site')
.on('error', (e) => console.error(`Failed: ${e.url}`, e.error))
.on('progress', (e) => console.log(`${e.percent}%`))
.getAll([
'https://example.com/',
'https://example.com/docs',
]);
console.log(`Mirror complete: ${stats.filesWritten} files, ${stats.urlsFailed} errors`); Internal Architecture
The Wget module is composed of several specialized classes:
| Class | File | Role |
|---|---|---|
Wget | src/wget/index.ts | Public API, fluent builder, event routing |
Downloader | src/wget/downloader.ts | Core engine: queue, retry, orchestration |
AssetExtractor | src/wget/asset-extractor.ts | DOM-based URL extraction from HTML/CSS/XML/JS |
AssetOrganizer | src/wget/asset-organizer.ts | Folder categorization and hash deduplication |
UrlFilter | src/wget/url-filter.ts | Domain, pattern, depth, directory filtering |
LinkConverter | src/wget/link-converter.ts | Post-download URL rewriting for offline use |
StyleExtractor | src/wget/style-extractor.ts | Inline style extraction to CSS files |
FileWriter | src/wget/file-writer.ts | Disk I/O with collision handling |
RobotsHandler | src/wget/robots.ts | robots.txt parsing and enforcement |
ResumeHandler | src/wget/resume.ts | Partial download resumption |
ProgressReporter | src/wget/progress.ts | Progress calculation and events |
DownloadCache | src/wget/download-cache.ts | In-memory download deduplication |