Crawler

Crawler Overview

Rezo includes a full-featured web crawler designed for production workloads. It is imported from rezo/crawler and provides queue-based URL processing, automatic caching, URL deduplication, robots.txt compliance, memory management, health monitoring, and optional stealth integration.

import { Crawler, CrawlerOptions } from 'rezo/crawler';

The crawler depends on linkedom for HTML parsing. It is bundled as a regular dependency of rezo, so installing rezo is enough — no separate install step is required.

Core Architecture

The crawler uses a queue-based architecture with separate queues for page crawling and scraping tasks:

Crawler
  ├── Main Queue (RezoQueue) ---- processes page URLs with configurable concurrency
  ├── Scraper Queue (RezoQueue) - handles email extraction and external site parsing
  ├── FileCacher (SQLite) ----- caches HTTP responses to avoid re-fetching
  ├── UrlStore (SQLite) ------- tracks visited URLs with TTL and deduplication
  ├── NavigationHistory ------- persists crawl state for resumable sessions
  ├── RobotsTxt --------------- parses and enforces robots.txt rules
  ├── MemoryMonitor ----------- tracks V8 heap usage, throttles on pressure
  ├── HealthMetrics ----------- real-time RPS, success rate, p95 latency
  └── Rezo HTTP Client -------- executes requests with adapter selection

Basic Usage

import { Crawler } from 'rezo/crawler';

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  concurrency: 20,
  maxDepth: 3,
  maxUrls: 1000,
  timeout: 15000
});

// Handle each page's HTML document
crawler.onDocument(async function (document) {
  const title = document.querySelector('title')?.textContent;
  console.log(`${document.location?.href}: ${title}`);
});

// Start crawling
await crawler.visit('https://example.com');

// Wait for all queued work to complete
await crawler.done();

Constructor

const crawler = new Crawler(crawlerOptions: ICrawlerOptions, http?: Rezo);

Parameters:

  • crawlerOptions — an ICrawlerOptions object or a CrawlerOptions builder instance that defines all crawler behavior
  • http (optional) — an existing Rezo HTTP client instance. If omitted, the crawler creates its own Rezo instance configured from the options.

Event Handlers

The crawler provides a set of callback-based event handlers:

// Called for every HTML document fetched. The handler receives the
// parsed `Document` (linkedom). Use `document.location.href` if you
// need the URL, or capture the URL from the surrounding `visit()` call.
crawler.onDocument(async function (document) {
  const title = document.querySelector('title')?.textContent;
  // ...
});

// Called for each <a> on the page. The handler receives the live
// HTMLAnchorElement directly — read `.href`, `.textContent`,
// `.rel`, etc. straight off the element.
crawler.onAnchor(async function (anchor) {
  const href = anchor.href;
  const text = anchor.textContent?.trim();
  const isNofollow = (anchor.rel || '').split(/s+/).includes('nofollow');
});

// Called for each text node matching a CSS selector
crawler.onText('h1', async function (text) {
  // text: string content
  // `this` is the DOM element (use function syntax, not arrow)
});

// Called when an email address is discovered
crawler.onEmailDiscovered(async function (event) {
  // event: { email, discoveredAt, timestamp, metadata }
});

// Called when a redirect is detected
crawler.onRedirect(async function (event) {
  // event: { originalUrl, finalUrl, redirectCount, statusCode }
});

Visiting URLs

// Visit a single URL
await crawler.visit('https://example.com/page');

// Visit with HTTP method and body
await crawler.visit('https://example.com/api', 'POST', { query: 'data' });

For Oxylabs or Decodo proxy services, configure them on the CrawlerOptions builder before constructing the crawler — then every matching visit() call routes through them automatically. See the proxy services page for the full configuration shape.

Stealth Integration

Pass a RezoStealth instance to the crawler for browser-like fingerprinting:

import { Crawler } from 'rezo/crawler';
import { RezoStealth } from 'rezo/stealth';

const crawler = new Crawler({
  baseUrl: 'https://protected-site.com',
  stealth: RezoStealth.chrome(),
  concurrency: 10
});

With stealth enabled, every HTTP request uses the stealth profile’s TLS fingerprint, header ordering, client hints, and User-Agent — overriding useRndUserAgent if set.

AutoThrottle

AutoThrottle dynamically adjusts request delay based on server response times:

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  autoThrottle: true,
  autoThrottleTargetDelay: 1000,  // Target 1 request/second
  autoThrottleMinDelay: 100,      // Never faster than 100ms between requests
  autoThrottleMaxDelay: 60000     // Never slower than 60s between requests
});

Signal Handlers

Enable graceful shutdown that persists crawl state on SIGINT/SIGTERM:

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  enableSignalHandlers: true,
  enableNavigationHistory: true,
  sessionId: 'my-crawl-session'
});

// If the process is interrupted (Ctrl+C), the crawler:
// 1. Pauses the queue
// 2. Saves the session state to SQLite
// 3. Exits cleanly
// On next run with the same sessionId, it resumes from where it left off

Adapter Selection

The crawler supports multiple HTTP adapters:

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  adapter: 'http'   // Standard Node.js HTTP (default)
  // adapter: 'http2'  // HTTP/2 with session pooling
  // adapter: 'curl'   // cURL for maximum compatibility
  // adapter: 'fetch'  // Fetch API
});

Export Results

The crawler’s internal navigation history can be exported in JSON, JSONL, or CSV format via exportData(filePath, format?). The format defaults to 'json'.

await crawler.visit('https://example.com');
await crawler.done();

// Exports the crawler's collected page/URL state to disk
await crawler.exportData('./output.jsonl', 'jsonl');
await crawler.exportData('./output.csv',   'csv');

If you want to collect your own custom records (titles, prices, anything you compute inside onDocument), keep them in your own array and write them out yourself — exportData only knows about the crawler’s internal state.

Complete Example

import { Crawler, CrawlerOptions } from 'rezo/crawler';
import { RezoStealth } from 'rezo/stealth';

const options = new CrawlerOptions({
  baseUrl: 'https://example.com',
  concurrency: 30,
  scraperConcurrency: 10,
  maxDepth: 5,
  maxUrls: 5000,
  timeout: 20000,
  enableCache: true,
  cacheTTL: 86400000,
  respectRobotsTxt: true,
  autoThrottle: true,
  enableNavigationHistory: true,
  enableSignalHandlers: true
});

options.addProxy({
  domain: 'example.com',
  proxy: { host: 'proxy.example.com', port: 8080 }
});

options.addLimiter({
  domain: 'example.com',
  options: { concurrency: 5, interval: 1000, intervalCap: 3 }
});

const crawler = new Crawler(options);

crawler.onDocument(async function (doc) {
  console.log(`Crawled: ${doc.location?.href}`);
});

await crawler.visit('https://example.com');
await crawler.done();