Crawler Overview
Rezo includes a full-featured web crawler designed for production workloads. It is imported from rezo/crawler and provides queue-based URL processing, automatic caching, URL deduplication, robots.txt compliance, memory management, health monitoring, and optional stealth integration.
import { Crawler, CrawlerOptions } from 'rezo/crawler'; Core Architecture
The crawler uses a queue-based architecture with separate queues for page crawling and scraping tasks:
Crawler
├── Main Queue (RezoQueue) ---- processes page URLs with configurable concurrency
├── Scraper Queue (RezoQueue) - handles email extraction and external site parsing
├── FileCacher (SQLite) ----- caches HTTP responses to avoid re-fetching
├── UrlStore (SQLite) ------- tracks visited URLs with TTL and deduplication
├── NavigationHistory ------- persists crawl state for resumable sessions
├── RobotsTxt --------------- parses and enforces robots.txt rules
├── MemoryMonitor ----------- tracks V8 heap usage, throttles on pressure
├── HealthMetrics ----------- real-time RPS, success rate, p95 latency
└── Rezo HTTP Client -------- executes requests with adapter selection Basic Usage
import { Crawler } from 'rezo/crawler';
const crawler = new Crawler({
baseUrl: 'https://example.com',
concurrency: 20,
maxDepth: 3,
maxUrls: 1000,
timeout: 15000
});
// Handle each page's HTML document
crawler.onDocument(async function (document, response) {
const title = document.querySelector('title')?.textContent;
console.log(`${response.url}: ${title}`);
});
// Start crawling
await crawler.visit('https://example.com');
// Wait for all queued work to complete
await crawler.done(); Constructor
const crawler = new Crawler(crawlerOptions: ICrawlerOptions, http?: Rezo); Parameters:
crawlerOptions— anICrawlerOptionsobject or aCrawlerOptionsbuilder instance that defines all crawler behaviorhttp(optional) — an existing Rezo HTTP client instance. If omitted, the crawler creates its own Rezo instance configured from the options.
Event Handlers
The crawler provides a set of callback-based event handlers:
// Called for every HTML document fetched
crawler.onDocument(async function (document, response) {
// document: parsed DOM (linkedom)
// response: { url, status, headers, data, contentType, finalUrl }
});
// Called for each anchor element found on a page
crawler.onAnchor(async function (anchor) {
// anchor: { element, href, linkText, rel, isNofollow }
});
// Called for each text node matching a CSS selector
crawler.onText('h1', async function (text) {
// text: string content
// `this` is the DOM element (use function syntax, not arrow)
});
// Called when an email address is discovered
crawler.onEmailDiscovered(async function (event) {
// event: { email, discoveredAt, timestamp, metadata }
});
// Called when a redirect is detected
crawler.onRedirect(async function (event) {
// event: { originalUrl, finalUrl, redirectCount, statusCode }
}); Visiting URLs
// Visit a single URL
await crawler.visit('https://example.com/page');
// Visit with HTTP method and body
await crawler.visit('https://example.com/api', 'POST', { query: 'data' });
// Visit through Oxylabs proxy
await crawler.visitOxylabs('https://example.com/page', { geoLocation: 'United States' });
// Visit through Decodo proxy
await crawler.visitDecodo('https://example.com/page', { country: 'Germany' }); Stealth Integration
Pass a RezoStealth instance to the crawler for browser-like fingerprinting:
import { Crawler } from 'rezo/crawler';
import { RezoStealth } from 'rezo/stealth';
const crawler = new Crawler({
baseUrl: 'https://protected-site.com',
stealth: RezoStealth.chrome(),
concurrency: 10
}); With stealth enabled, every HTTP request uses the stealth profile’s TLS fingerprint, header ordering, client hints, and User-Agent — overriding useRndUserAgent if set.
AutoThrottle
AutoThrottle dynamically adjusts request delay based on server response times:
const crawler = new Crawler({
baseUrl: 'https://example.com',
autoThrottle: true,
autoThrottleTargetDelay: 1000, // Target 1 request/second
autoThrottleMinDelay: 100, // Never faster than 100ms between requests
autoThrottleMaxDelay: 60000 // Never slower than 60s between requests
}); Signal Handlers
Enable graceful shutdown that persists crawl state on SIGINT/SIGTERM:
const crawler = new Crawler({
baseUrl: 'https://example.com',
enableSignalHandlers: true,
enableNavigationHistory: true,
sessionId: 'my-crawl-session'
});
// If the process is interrupted (Ctrl+C), the crawler:
// 1. Pauses the queue
// 2. Saves the session state to SQLite
// 3. Exits cleanly
// On next run with the same sessionId, it resumes from where it left off Adapter Selection
The crawler supports multiple HTTP adapters:
const crawler = new Crawler({
baseUrl: 'https://example.com',
adapter: 'http' // Standard Node.js HTTP (default)
// adapter: 'http2' // HTTP/2 with session pooling
// adapter: 'curl' // cURL for maximum compatibility
// adapter: 'fetch' // Fetch API
}); Export Results
Export collected data in JSON, JSONL, or CSV format:
// Collect results
const results = [];
crawler.onDocument(async function (doc, res) {
results.push({ url: res.url, title: doc.querySelector('title')?.textContent });
});
await crawler.visit('https://example.com');
await crawler.done();
// Export
await crawler.export(results, './output.jsonl', 'jsonl');
await crawler.export(results, './output.csv', 'csv'); Complete Example
import { Crawler, CrawlerOptions } from 'rezo/crawler';
import { RezoStealth } from 'rezo/stealth';
const options = new CrawlerOptions({
baseUrl: 'https://example.com',
concurrency: 30,
scraperConcurrency: 10,
maxDepth: 5,
maxUrls: 5000,
timeout: 20000,
enableCache: true,
cacheTTL: 86400000,
respectRobotsTxt: true,
autoThrottle: true,
enableNavigationHistory: true,
enableSignalHandlers: true
});
options.addProxy({
domain: 'example.com',
proxy: { host: 'proxy.example.com', port: 8080 }
});
options.addLimiter({
domain: 'example.com',
options: { concurrency: 5, interval: 1000, intervalCap: 3 }
});
const crawler = new Crawler(options);
crawler.onDocument(async function (doc, res) {
console.log(`Crawled: ${res.url}`);
});
await crawler.visit('https://example.com');
await crawler.done();