Asset Extraction

The Wget module uses DOM-based parsing (via linkedom) rather than regex to extract asset URLs from web documents. This approach handles edge cases that regex-based extractors miss: nested quotes, encoded attributes, <base> tags, and dynamically-generated markup.

AssetExtractor

AssetExtractor is the core extraction engine. It scans HTML, CSS, XML/SVG, and JavaScript files for asset URLs, classifying each by type.

import { AssetExtractor } from 'rezo/wget';

const extractor = new AssetExtractor();

Extracting from HTML

extractFromHTML() parses an HTML document and returns every asset URL found in tags, attributes, inline styles, and <style> blocks:

const html = `
  <html>
    <head>
      <link rel="stylesheet" href="/css/main.css">
      <link rel="icon" href="/favicon.ico">
    </head>
    <body>
      <img src="/images/logo.png" srcset="/images/logo-2x.png 2x">
      <script src="/js/app.js"></script>
      <div style="background: url('/images/bg.jpg')"></div>
      <style>
        @import url('/css/theme.css');
        body { background: url('/images/pattern.svg'); }
      </style>
    </body>
  </html>
`;

const assets = extractor.extractFromHTML(html, 'https://example.com/page.html');

for (const asset of assets) {
  console.log(`${asset.type}: ${asset.url}`);
  // stylesheet: https://example.com/css/main.css
  // favicon: https://example.com/favicon.ico
  // image: https://example.com/images/logo.png
  // image: https://example.com/images/logo-2x.png
  // script: https://example.com/js/app.js
  // image: https://example.com/images/bg.jpg
  // stylesheet: https://example.com/css/theme.css
  // image: https://example.com/images/pattern.svg
}

Supported HTML Tags and Attributes

The extractor scans these tags for URL-containing attributes:

Tag	Attributes
`a`, `area`, `link`, `base`	`href`
`img`	`src`, `srcset`, `data-src`, `data-srcset`, `data-lazy-src`
`source`	`src`, `srcset`
`video`	`src`, `poster`
`audio`, `track`, `iframe`, `frame`, `embed`	`src`
`script`	`src`
`object`	`data`, `codebase`
`form`	`action`
`input`	`src` (for `type="image"`)
`meta`	`content` (for og:image, twitter:image, etc.)

It also processes:

Inline style attributes on any element for url() references
<style> tag contents for @import and url() functions
srcset attributes with multi-resolution image candidates
<base href> tags for correct relative URL resolution

Tag Filtering

You can limit extraction to specific tags:

const assets = extractor.extractFromHTML(html, baseUrl, {
  followTags: ['link', 'img', 'style'],  // Only these tags
});

// Or exclude specific tags:
const assets = extractor.extractFromHTML(html, baseUrl, {
  ignoreTags: ['script', 'iframe'],       // Skip these tags
});

Extracting from CSS

extractFromCSS() finds all URL references in stylesheets:

const css = `
  @import url('base.css');
  @import 'theme.css';
  body {
    background: url('/images/bg.png');
    font-face: url('../fonts/custom.woff2');
  }
`;

const assets = extractor.extractFromCSS(css, 'https://example.com/css/main.css');
// Returns: base.css, theme.css, /images/bg.png, ../fonts/custom.woff2
// All resolved to absolute URLs against the base URL

Data URLs (data:...) are automatically skipped. Fragment-only URLs (#section) are ignored.

Extracting from XML/SVG

extractFromXML() handles SVG and XML documents, scanning for href, src, and xlink:href attributes:

const svg = `
  <svg xmlns="http://www.w3.org/2000/svg">
    <image href="logo.png" />
    <use xlink:href="icons.svg#home" />
  </svg>
`;

const assets = extractor.extractFromXML(svg, 'https://example.com/assets/icon.svg');

Extracting from JavaScript

extractFromJS() uses pattern matching to find URL-like strings in JavaScript. This is best-effort — it catches common patterns like fetch URLs and asset paths but may produce false positives:

const js = `
  fetch('/api/data.json');
  const img = './images/hero.png';
  loadScript('https://cdn.example.com/lib.js');
`;

const assets = extractor.extractFromJS(js, 'https://example.com/js/app.js');

Only URLs with recognized extensions (.js, .css, .png, .jpg, .json, .html, etc.) are included.

Auto-Detection with `extract()`

The extract() method routes to the correct extraction method based on MIME type:

const assets = extractor.extract(content, 'text/html', baseUrl);
const assets = extractor.extract(content, 'text/css', baseUrl);
const assets = extractor.extract(content, 'image/svg+xml', baseUrl);
const assets = extractor.extract(content, 'application/javascript', baseUrl);

Asset Types

Every extracted asset is classified into one of these types:

type AssetType =
  | 'document'    // HTML, XHTML, PHP, ASP, JSP
  | 'stylesheet'  // CSS
  | 'script'      // JavaScript
  | 'image'       // PNG, JPG, SVG, WebP, AVIF, etc.
  | 'font'        // WOFF, WOFF2, TTF, OTF, EOT
  | 'video'       // MP4, WebM, OGG
  | 'audio'       // MP3, WAV, FLAC, AAC
  | 'favicon'     // Icons linked via rel="icon"
  | 'manifest'    // Web manifests
  | 'iframe'      // Embedded frames
  | 'object'      // Embedded objects
  | 'data'        // JSON, XML
  | 'other';      // Unrecognized

The `ExtractedAsset` Interface

interface ExtractedAsset {
  url: string;           // Resolved absolute URL
  type: AssetType;       // Classified asset type
  source: 'html' | 'css' | 'js' | 'svg' | 'xml';
  tag?: string;          // HTML tag that contained the URL
  attribute?: string;    // Attribute name (href, src, etc.)
  required: boolean;     // Is this a page requisite?
  inline: boolean;       // Was this from an inline style/style tag?
}

Filtering with `acceptAssetTypes`

The Wget module integrates AssetExtractor.filterAssets() to apply type-based filtering:

import { Wget } from 'rezo/wget';

await new Wget({
  filter: {
    acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
  },
}).get('https://example.com/');

The filterAssets() method supports:

acceptAssetTypes — Only keep assets of these types
rejectAssetTypes — Exclude assets of these types
followTags — Only extract from these HTML tags
ignoreTags — Skip these HTML tags
accept / reject — Glob patterns against the full URL
acceptRegex / rejectRegex — Regex patterns against the URL
excludeExtensions — Exclude specific file extensions

AssetOrganizer

AssetOrganizer categorizes downloaded assets into logical folders and performs hash-based deduplication.

Default Folder Structure

Category	Folder	File Types
Stylesheets	`css/`	`.css`
JavaScript	`js/`	`.js`, `.mjs`, `.cjs`
Images	`images/`	`.png`, `.jpg`, `.gif`, `.webp`, `.svg`, `.ico`, `.avif`
Fonts	`fonts/`	`.woff`, `.woff2`, `.ttf`, `.otf`, `.eot`
Audio	`audio/`	`.mp3`, `.wav`, `.ogg`, `.aac`, `.flac`
Video	`video/`	`.mp4`, `.webm`, `.ogv`, `.mov`
Other	`assets/`	Everything else

HTML documents are never reorganized — they keep their URL-based path structure.

Custom Folder Names

const wg = new Wget({
  organizeAssets: true,
  assetFolders: {
    css: 'styles',
    js: 'scripts',
    images: 'img',
  },
});

Hash-Based Deduplication

The organizer computes an MD5 hash of each downloaded file’s content. If two different URLs serve the same bytes, only one copy is stored on disk:

// Both URLs serve the same image
// https://example.com/images/logo.png
// https://cdn.example.com/logo.png
// -> Only one file saved: images/logo.png

StyleExtractor

StyleExtractor post-processes downloaded HTML files to extract inline <style> tags into standalone CSS files.

How It Works

Parses the HTML with linkedom
Finds all <style> tags with content
Writes each to a separate .css file
Replaces the <style> tag with a <link rel="stylesheet"> reference
Preserves media attributes on the generated <link> tag

Naming Convention

CSS files are named using the first available attribute on the <style> tag:

Priority	Attribute	Resulting Filename
1	`id`	`internal.{id}.css`
2	`name`	`internal.{name}.css`
3	`class` (first)	`internal.{class}.css`
4	(none)	`internal.{index}.css`

Example

Before:

<head>
  <style id="theme">
    body { background: #1a1a2e; color: #e0e0e0; }
    .header { border-bottom: 1px solid #333; }
  </style>
  <style>
    .sidebar { width: 300px; }
  </style>
</head>

After (with organizeAssets: true):

<head>
  <link rel="stylesheet" href="./css/internal.theme.css">
  <link rel="stylesheet" href="./css/internal.1.css">
</head>

Usage with Wget

await new Wget({
  extractInternalStyles: true,
  organizeAssets: true,
  download: { outputDir: './mirror' },
})
  .convertLinks()
  .pageRequisites()
  .get('https://example.com/');

When removeJavascript is also enabled, StyleExtractor removes all <script> tags and inline event handlers in the same pass.