URL Filtering
The Wget module provides comprehensive URL filtering through the UrlFilter class. Every URL discovered during crawling is evaluated against your filter rules before being added to the download queue. This lets you precisely control what gets downloaded.
UrlFilter
UrlFilter implements all of wget’s URL filtering logic: domain restrictions, glob patterns, regex matching, directory inclusion/exclusion, depth limits, and parent-directory restrictions.
import { Wget } from 'rezo/wget';
await new Wget({
filter: {
domains: ['example.com', 'cdn.example.com'],
noParent: true,
accept: ['*.html', '*.css', '*.png'],
reject: ['*tracking*', '*analytics*'],
excludeDirectories: ['/admin/', '/private/'],
excludeExtensions: ['.exe', '.zip', '.tar.gz'],
},
recursive: {
depth: 3,
},
}).get('https://example.com/docs/'); Domain Filtering
Allowed Domains
Restrict downloads to specific domains. Supports subdomain matching — specifying example.com also matches www.example.com and cdn.example.com.
await new Wget({
filter: {
domains: ['example.com'],
},
}).get('https://example.com/');
// Downloads from: example.com, www.example.com, cdn.example.com
// Blocks: other-site.com, evil.com You can pass a comma-separated string or an array:
// Both are equivalent:
domains: ['example.com', 'partner.com']
domains: 'example.com,partner.com' Excluded Domains
Block specific domains while allowing everything else:
await new Wget({
filter: {
excludeDomains: ['ads.example.com', 'tracking.example.com'],
},
}).get('https://example.com/'); Cross-Host Behavior
By default, the Wget module only follows links to the same host as the starting URL. Two options modify this:
spanHosts: true— Allow downloads from any domain (domain filters still apply)pageRequisites: true— Allow assets from sibling subdomains (same root domain)
// Allow all hosts but filter by domain
await new Wget({
filter: {
spanHosts: true,
domains: ['example.com', 'cdn.partner.com'],
},
}).get('https://example.com/'); Pattern Filtering
Glob Patterns (accept/reject)
Filter URLs using glob patterns with * (any characters) and ? (single character) wildcards:
await new Wget({
filter: {
accept: ['*.html', '*.css', '*.png', '*.jpg'],
reject: ['*thumbnail*', '*temp*'],
},
}).get('https://example.com/'); Patterns are matched against both the full URL and the filename portion.
Regex Patterns
For more precise control, use regular expressions:
await new Wget({
filter: {
acceptRegex: /.(html|css|js|png|jpg|svg)$/i,
rejectRegex: //(api|admin|login)//,
},
}).get('https://example.com/'); You can pass either a RegExp object or a string pattern:
// These are equivalent:
acceptRegex: /.html$/
acceptRegex: '\.html$' Extension Exclusions
Exclude files by extension:
await new Wget({
filter: {
excludeExtensions: ['.exe', '.zip', '.dmg', '.iso', '.mp4'],
},
}).get('https://example.com/'); Directory Filtering
Include Directories
Only download URLs whose paths start with specified directories:
await new Wget({
filter: {
includeDirectories: ['/docs/', '/api/public/'],
},
}).get('https://example.com/');
// Downloads: /docs/guide.html, /docs/images/fig1.png
// Blocks: /blog/post.html, /about.html Exclude Directories
Block URLs in specified directories:
await new Wget({
filter: {
excludeDirectories: ['/private/', '/admin/', '/tmp/'],
},
}).get('https://example.com/'); No Parent
Prevent the crawler from ascending above the starting URL’s directory:
await new Wget({
filter: { noParent: true },
}).get('https://example.com/docs/v2/');
// Downloads: /docs/v2/guide.html, /docs/v2/api/ref.html
// Blocks: /docs/v1/old.html, /about.html Depth Limiting
Control how deep the recursive crawler goes:
await new Wget({
recursive: {
enabled: true,
depth: 2, // Only follow links 2 levels deep
},
}).get('https://example.com/');
// Depth 0: https://example.com/
// Depth 1: https://example.com/about (linked from depth 0)
// Depth 2: https://example.com/team (linked from depth 1)
// BLOCKED: https://example.com/bios (would be depth 3) Special values:
depth: 0ordepth: Infinity— Unlimited depthmirror: true— Implies unlimited depth
followTags Configuration
The followTags option controls which HTML tags the asset extractor processes. This is different from URL filtering — it determines which URLs are discovered in the first place.
await new Wget({
filter: {
followTags: ['link', 'img', 'style'],
// Only extract URLs from <link>, <img>, and <style> tags
// Ignores <a>, <script>, <iframe>, etc.
},
}).get('https://example.com/'); To exclude specific tags while keeping the rest:
await new Wget({
filter: {
ignoreTags: ['script', 'iframe', 'video'],
},
}).get('https://example.com/'); Asset Type Filtering
Filter by classified asset type (determined by tag, extension, and MIME type):
await new Wget({
filter: {
acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
// Blocks: script, video, audio, iframe, object, data, other
},
}).get('https://example.com/'); Or exclude specific types:
await new Wget({
filter: {
rejectAssetTypes: ['video', 'audio'],
},
}).get('https://example.com/'); Predefined Filter Lists
The filter-lists.ts module provides comprehensive predefined extension lists for common filtering scenarios:
import {
EXECUTABLE_EXTENSIONS,
ARCHIVE_EXTENSIONS,
} from 'rezo/wget'; Available Lists
| List | Description |
|---|---|
EXECUTABLE_EXTENSIONS | Windows/macOS/Linux executables, scripts, packages (.exe, .sh, .dmg, .apk, etc.) |
ARCHIVE_EXTENSIONS | Archives and compressed files (.zip, .tar.gz, .7z, .rar, etc.) |
Usage with Wget
import { Wget, EXECUTABLE_EXTENSIONS, ARCHIVE_EXTENSIONS } from 'rezo/wget';
await new Wget({
filter: {
excludeExtensions: [
...EXECUTABLE_EXTENSIONS,
...ARCHIVE_EXTENSIONS,
],
},
}).get('https://example.com/'); Filter Result
Every URL evaluation returns a FilterResult explaining why it was allowed or blocked:
interface FilterResult {
allowed: boolean;
reason?: SkipReason;
message?: string;
} Skip reasons include:
| Reason | Description |
|---|---|
invalid-url | URL could not be parsed |
unsupported-protocol | Not HTTP(S) |
depth-exceeded | Beyond maximum recursion depth |
cross-host | Different host without spanHosts |
domain-excluded | Not in allowed domains or in excluded domains |
parent-directory | Above starting URL path with noParent |
directory-excluded | In excluded directory or not in included directory |
pattern-rejected | Matched reject pattern or extension exclusion |
pattern-not-accepted | Did not match required accept pattern |
You can observe these via the skip event:
new Wget()
.on('skip', (event) => {
console.log(`Skipped ${event.url}: ${event.reason}`);
})
.get('https://example.com/'); Complete Example
Comprehensive filtering for a documentation mirror:
import { Wget, EXECUTABLE_EXTENSIONS } from 'rezo/wget';
await new Wget({
recursive: {
enabled: true,
depth: 5,
},
filter: {
domains: ['docs.example.com'],
noParent: true,
acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
ignoreTags: ['iframe', 'video', 'audio'],
excludeDirectories: ['/api/', '/admin/'],
excludeExtensions: [...EXECUTABLE_EXTENSIONS],
rejectRegex: //(login|signup|oauth)//,
},
})
.concurrency(5)
.convertLinks()
.pageRequisites()
.outputDir('./docs-offline')
.on('skip', (e) => {
if (e.reason !== 'cross-host') {
console.log(`Filtered: ${e.url} (${e.reason})`);
}
})
.get('https://docs.example.com/v3/');