Asset Extraction
The Wget module uses DOM-based parsing (via linkedom) rather than regex to extract asset URLs from web documents. This approach handles edge cases that regex-based extractors miss: nested quotes, encoded attributes, <base> tags, and dynamically-generated markup.
AssetExtractor
AssetExtractor is the core extraction engine. It scans HTML, CSS, XML/SVG, and JavaScript files for asset URLs, classifying each by type.
import { AssetExtractor } from 'rezo/wget';
const extractor = new AssetExtractor(); Extracting from HTML
extractFromHTML() parses an HTML document and returns every asset URL found in tags, attributes, inline styles, and <style> blocks:
const html = `
<html>
<head>
<link rel="stylesheet" href="/css/main.css">
<link rel="icon" href="/favicon.ico">
</head>
<body>
<img src="/images/logo.png" srcset="/images/logo-2x.png 2x">
<script src="/js/app.js"></script>
<div style="background: url('/images/bg.jpg')"></div>
<style>
@import url('/css/theme.css');
body { background: url('/images/pattern.svg'); }
</style>
</body>
</html>
`;
const assets = extractor.extractFromHTML(html, 'https://example.com/page.html');
for (const asset of assets) {
console.log(`${asset.type}: ${asset.url}`);
// stylesheet: https://example.com/css/main.css
// favicon: https://example.com/favicon.ico
// image: https://example.com/images/logo.png
// image: https://example.com/images/logo-2x.png
// script: https://example.com/js/app.js
// image: https://example.com/images/bg.jpg
// stylesheet: https://example.com/css/theme.css
// image: https://example.com/images/pattern.svg
} Supported HTML Tags and Attributes
The extractor scans these tags for URL-containing attributes:
| Tag | Attributes |
|---|---|
a, area, link, base | href |
img | src, srcset, data-src, data-srcset, data-lazy-src |
source | src, srcset |
video | src, poster |
audio, track, iframe, frame, embed | src |
script | src |
object | data, codebase |
form | action |
input | src (for type="image") |
meta | content (for og:image, twitter:image, etc.) |
It also processes:
- Inline
styleattributes on any element forurl()references <style>tag contents for@importandurl()functionssrcsetattributes with multi-resolution image candidates<base href>tags for correct relative URL resolution
Tag Filtering
You can limit extraction to specific tags:
const assets = extractor.extractFromHTML(html, baseUrl, {
followTags: ['link', 'img', 'style'], // Only these tags
});
// Or exclude specific tags:
const assets = extractor.extractFromHTML(html, baseUrl, {
ignoreTags: ['script', 'iframe'], // Skip these tags
}); Extracting from CSS
extractFromCSS() finds all URL references in stylesheets:
const css = `
@import url('base.css');
@import 'theme.css';
body {
background: url('/images/bg.png');
font-face: url('../fonts/custom.woff2');
}
`;
const assets = extractor.extractFromCSS(css, 'https://example.com/css/main.css');
// Returns: base.css, theme.css, /images/bg.png, ../fonts/custom.woff2
// All resolved to absolute URLs against the base URL Data URLs (data:...) are automatically skipped. Fragment-only URLs (#section) are ignored.
Extracting from XML/SVG
extractFromXML() handles SVG and XML documents, scanning for href, src, and xlink:href attributes:
const svg = `
<svg xmlns="http://www.w3.org/2000/svg">
<image href="logo.png" />
<use xlink:href="icons.svg#home" />
</svg>
`;
const assets = extractor.extractFromXML(svg, 'https://example.com/assets/icon.svg'); Extracting from JavaScript
extractFromJS() uses pattern matching to find URL-like strings in JavaScript. This is best-effort — it catches common patterns like fetch URLs and asset paths but may produce false positives:
const js = `
fetch('/api/data.json');
const img = './images/hero.png';
loadScript('https://cdn.example.com/lib.js');
`;
const assets = extractor.extractFromJS(js, 'https://example.com/js/app.js'); Only URLs with recognized extensions (.js, .css, .png, .jpg, .json, .html, etc.) are included.
Auto-Detection with extract()
The extract() method routes to the correct extraction method based on MIME type:
const assets = extractor.extract(content, 'text/html', baseUrl);
const assets = extractor.extract(content, 'text/css', baseUrl);
const assets = extractor.extract(content, 'image/svg+xml', baseUrl);
const assets = extractor.extract(content, 'application/javascript', baseUrl); Asset Types
Every extracted asset is classified into one of these types:
type AssetType =
| 'document' // HTML, XHTML, PHP, ASP, JSP
| 'stylesheet' // CSS
| 'script' // JavaScript
| 'image' // PNG, JPG, SVG, WebP, AVIF, etc.
| 'font' // WOFF, WOFF2, TTF, OTF, EOT
| 'video' // MP4, WebM, OGG
| 'audio' // MP3, WAV, FLAC, AAC
| 'favicon' // Icons linked via rel="icon"
| 'manifest' // Web manifests
| 'iframe' // Embedded frames
| 'object' // Embedded objects
| 'data' // JSON, XML
| 'other'; // Unrecognized The ExtractedAsset Interface
interface ExtractedAsset {
url: string; // Resolved absolute URL
type: AssetType; // Classified asset type
source: 'html' | 'css' | 'js' | 'svg' | 'xml';
tag?: string; // HTML tag that contained the URL
attribute?: string; // Attribute name (href, src, etc.)
required: boolean; // Is this a page requisite?
inline: boolean; // Was this from an inline style/style tag?
} Filtering with acceptAssetTypes
The Wget module integrates AssetExtractor.filterAssets() to apply type-based filtering:
import { Wget } from 'rezo/wget';
await new Wget({
filter: {
acceptAssetTypes: ['document', 'stylesheet', 'image', 'font', 'favicon'],
},
}).get('https://example.com/'); The filterAssets() method supports:
acceptAssetTypes— Only keep assets of these typesrejectAssetTypes— Exclude assets of these typesfollowTags— Only extract from these HTML tagsignoreTags— Skip these HTML tagsaccept/reject— Glob patterns against the full URLacceptRegex/rejectRegex— Regex patterns against the URLexcludeExtensions— Exclude specific file extensions
AssetOrganizer
AssetOrganizer categorizes downloaded assets into logical folders and performs hash-based deduplication.
Default Folder Structure
| Category | Folder | File Types |
|---|---|---|
| Stylesheets | css/ | .css |
| JavaScript | js/ | .js, .mjs, .cjs |
| Images | images/ | .png, .jpg, .gif, .webp, .svg, .ico, .avif |
| Fonts | fonts/ | .woff, .woff2, .ttf, .otf, .eot |
| Audio | audio/ | .mp3, .wav, .ogg, .aac, .flac |
| Video | video/ | .mp4, .webm, .ogv, .mov |
| Other | assets/ | Everything else |
HTML documents are never reorganized — they keep their URL-based path structure.
Custom Folder Names
const wg = new Wget({
organizeAssets: true,
assetFolders: {
css: 'styles',
js: 'scripts',
images: 'img',
},
}); Hash-Based Deduplication
The organizer computes an MD5 hash of each downloaded file’s content. If two different URLs serve the same bytes, only one copy is stored on disk:
// Both URLs serve the same image
// https://example.com/images/logo.png
// https://cdn.example.com/logo.png
// -> Only one file saved: images/logo.png StyleExtractor
StyleExtractor post-processes downloaded HTML files to extract inline <style> tags into standalone CSS files.
How It Works
- Parses the HTML with
linkedom - Finds all
<style>tags with content - Writes each to a separate
.cssfile - Replaces the
<style>tag with a<link rel="stylesheet">reference - Preserves
mediaattributes on the generated<link>tag
Naming Convention
CSS files are named using the first available attribute on the <style> tag:
| Priority | Attribute | Resulting Filename |
|---|---|---|
| 1 | id | internal.{id}.css |
| 2 | name | internal.{name}.css |
| 3 | class (first) | internal.{class}.css |
| 4 | (none) | internal.{index}.css |
Example
Before:
<head>
<style id="theme">
body { background: #1a1a2e; color: #e0e0e0; }
.header { border-bottom: 1px solid #333; }
</style>
<style>
.sidebar { width: 300px; }
</style>
</head> After (with organizeAssets: true):
<head>
<link rel="stylesheet" href="./css/internal.theme.css">
<link rel="stylesheet" href="./css/internal.1.css">
</head> Usage with Wget
await new Wget({
extractInternalStyles: true,
organizeAssets: true,
download: { outputDir: './mirror' },
})
.convertLinks()
.pageRequisites()
.get('https://example.com/'); When removeJavascript is also enabled, StyleExtractor removes all <script> tags and inline event handlers in the same pass.