URL Filtering

The Wget module provides comprehensive URL filtering through the UrlFilter class. Every URL discovered during crawling is evaluated against your filter rules before being added to the download queue. This lets you precisely control what gets downloaded.

UrlFilter

UrlFilter implements all of wget’s URL filtering logic: domain restrictions, glob patterns, regex matching, directory inclusion/exclusion, depth limits, and parent-directory restrictions.

import { Wget } from 'rezo/wget';

await new Wget({
  filter: {
    domains: ['example.com', 'cdn.example.com'],
    noParent: true,
    accept: ['*.html', '*.css', '*.png'],
    reject: ['*tracking*', '*analytics*'],
    excludeDirectories: ['/admin/', '/private/'],
    excludeExtensions: ['.exe', '.zip', '.tar.gz'],
  },
  recursive: {
    depth: 3,
  },
}).get('https://example.com/docs/');

Domain Filtering

Allowed Domains

Restrict downloads to specific domains. Supports subdomain matching — specifying example.com also matches www.example.com and cdn.example.com.

await new Wget({
  filter: {
    domains: ['example.com'],
  },
}).get('https://example.com/');

// Downloads from: example.com, www.example.com, cdn.example.com
// Blocks:        other-site.com, evil.com

You can pass a comma-separated string or an array:

// Both are equivalent:
domains: ['example.com', 'partner.com']
domains: 'example.com,partner.com'

Excluded Domains

Block specific domains while allowing everything else:

await new Wget({
  filter: {
    excludeDomains: ['ads.example.com', 'tracking.example.com'],
  },
}).get('https://example.com/');

Cross-Host Behavior

By default, the Wget module only follows links to the same host as the starting URL. Two options modify this:

spanHosts: true — Allow downloads from any domain (domain filters still apply)
pageRequisites: true — Allow assets from sibling subdomains (same root domain)

// Allow all hosts but filter by domain
await new Wget({
  filter: {
    spanHosts: true,
    domains: ['example.com', 'cdn.partner.com'],
  },
}).get('https://example.com/');

Pattern Filtering

Glob Patterns (accept/reject)

Filter URLs using glob patterns with * (any characters) and ? (single character) wildcards:

await new Wget({
  filter: {
    accept: ['*.html', '*.css', '*.png', '*.jpg'],
    reject: ['*thumbnail*', '*temp*'],
  },
}).get('https://example.com/');

Patterns are matched against both the full URL and the filename portion.

Regex Patterns

For more precise control, use regular expressions:

await new Wget({
  filter: {
    acceptRegex: /.(html|css|js|png|jpg|svg)$/i,
    rejectRegex: //(api|admin|login)//,
  },
}).get('https://example.com/');

You can pass either a RegExp object or a string pattern:

// These are equivalent:
acceptRegex: /.html$/
acceptRegex: '\.html$'

Extension Exclusions

Exclude files by extension:

await new Wget({
  filter: {
    excludeExtensions: ['.exe', '.zip', '.dmg', '.iso', '.mp4'],
  },
}).get('https://example.com/');

Directory Filtering

Include Directories

Only download URLs whose paths start with specified directories:

await new Wget({
  filter: {
    includeDirectories: ['/docs/', '/api/public/'],
  },
}).get('https://example.com/');

// Downloads: /docs/guide.html, /docs/images/fig1.png
// Blocks:    /blog/post.html, /about.html

Exclude Directories

Block URLs in specified directories:

await new Wget({
  filter: {
    excludeDirectories: ['/private/', '/admin/', '/tmp/'],
  },
}).get('https://example.com/');

No Parent

Prevent the crawler from ascending above the starting URL’s directory:

await new Wget({
  filter: { noParent: true },
}).get('https://example.com/docs/v2/');

// Downloads: /docs/v2/guide.html, /docs/v2/api/ref.html
// Blocks:    /docs/v1/old.html, /about.html

Depth Limiting

Control how deep the recursive crawler goes:

await new Wget({
  recursive: {
    enabled: true,
    depth: 2,  // Only follow links 2 levels deep
  },
}).get('https://example.com/');

// Depth 0: https://example.com/
// Depth 1: https://example.com/about  (linked from depth 0)
// Depth 2: https://example.com/team   (linked from depth 1)
// BLOCKED: https://example.com/bios   (would be depth 3)

Special values:

depth: 0 or depth: Infinity — Unlimited depth
mirror: true — Implies unlimited depth

followTags Configuration

The followTags option controls which HTML tags the asset extractor processes. This is different from URL filtering — it determines which URLs are discovered in the first place.

await new Wget({
  filter: {
    followTags: ['link', 'img', 'style'],
    // Only extract URLs from <link>, <img>, and <style> tags
    // Ignores <a>, <script>, <iframe>, etc.
  },
}).get('https://example.com/');

To exclude specific tags while keeping the rest:

await new Wget({
  filter: {
    ignoreTags: ['script', 'iframe', 'video'],
  },
}).get('https://example.com/');

Asset Type Filtering

Filter by classified asset type (determined by tag, extension, and MIME type):

await new Wget({
  filter: {
    acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
    // Blocks: script, video, audio, iframe, object, data, other
  },
}).get('https://example.com/');

Or exclude specific types:

await new Wget({
  filter: {
    rejectAssetTypes: ['video', 'audio'],
  },
}).get('https://example.com/');

Predefined Filter Lists

The filter-lists.ts module provides comprehensive predefined extension lists for common filtering scenarios:

import {
  EXECUTABLE_EXTENSIONS,
  ARCHIVE_EXTENSIONS,
} from 'rezo/wget';

Available Lists

List	Description
`EXECUTABLE_EXTENSIONS`	Windows/macOS/Linux executables, scripts, packages (.exe, .sh, .dmg, .apk, etc.)
`ARCHIVE_EXTENSIONS`	Archives and compressed files (.zip, .tar.gz, .7z, .rar, etc.)

Usage with Wget

import { Wget, EXECUTABLE_EXTENSIONS, ARCHIVE_EXTENSIONS } from 'rezo/wget';

await new Wget({
  filter: {
    excludeExtensions: [
      ...EXECUTABLE_EXTENSIONS,
      ...ARCHIVE_EXTENSIONS,
    ],
  },
}).get('https://example.com/');

Filter Result

Every URL evaluation returns a FilterResult explaining why it was allowed or blocked:

interface FilterResult {
  allowed: boolean;
  reason?: SkipReason;
  message?: string;
}

Skip reasons include:

Reason	Description
`invalid-url`	URL could not be parsed
`unsupported-protocol`	Not HTTP(S)
`depth-exceeded`	Beyond maximum recursion depth
`cross-host`	Different host without `spanHosts`
`domain-excluded`	Not in allowed domains or in excluded domains
`parent-directory`	Above starting URL path with `noParent`
`directory-excluded`	In excluded directory or not in included directory
`pattern-rejected`	Matched reject pattern or extension exclusion
`pattern-not-accepted`	Did not match required accept pattern

You can observe these via the skip event:

new Wget()
  .on('skip', (event) => {
    console.log(`Skipped ${event.url}: ${event.reason}`);
  })
  .get('https://example.com/');

Complete Example

Comprehensive filtering for a documentation mirror:

import { Wget, EXECUTABLE_EXTENSIONS } from 'rezo/wget';

await new Wget({
  recursive: {
    enabled: true,
    depth: 5,
  },
  filter: {
    domains: ['docs.example.com'],
    noParent: true,
    acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
    ignoreTags: ['iframe', 'video', 'audio'],
    excludeDirectories: ['/api/', '/admin/'],
    excludeExtensions: [...EXECUTABLE_EXTENSIONS],
    rejectRegex: //(login|signup|oauth)//,
  },
})
  .concurrency(5)
  .convertLinks()
  .pageRequisites()
  .outputDir('./docs-offline')
  .on('skip', (e) => {
    if (e.reason !== 'cross-host') {
      console.log(`Filtered: ${e.url} (${e.reason})`);
    }
  })
  .get('https://docs.example.com/v3/');