Crawler

Crawler Overview

Rezo includes a full-featured web crawler designed for production workloads. It is imported from rezo/crawler and provides queue-based URL processing, automatic caching, URL deduplication, robots.txt compliance, memory management, health monitoring, and optional stealth integration.

import { Crawler, CrawlerOptions } from 'rezo/crawler';

Core Architecture

The crawler uses a queue-based architecture with separate queues for page crawling and scraping tasks:

Crawler
  ├── Main Queue (RezoQueue) ---- processes page URLs with configurable concurrency
  ├── Scraper Queue (RezoQueue) - handles email extraction and external site parsing
  ├── FileCacher (SQLite) ----- caches HTTP responses to avoid re-fetching
  ├── UrlStore (SQLite) ------- tracks visited URLs with TTL and deduplication
  ├── NavigationHistory ------- persists crawl state for resumable sessions
  ├── RobotsTxt --------------- parses and enforces robots.txt rules
  ├── MemoryMonitor ----------- tracks V8 heap usage, throttles on pressure
  ├── HealthMetrics ----------- real-time RPS, success rate, p95 latency
  └── Rezo HTTP Client -------- executes requests with adapter selection

Basic Usage

import { Crawler } from 'rezo/crawler';

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  concurrency: 20,
  maxDepth: 3,
  maxUrls: 1000,
  timeout: 15000
});

// Handle each page's HTML document
crawler.onDocument(async function (document, response) {
  const title = document.querySelector('title')?.textContent;
  console.log(`${response.url}: ${title}`);
});

// Start crawling
await crawler.visit('https://example.com');

// Wait for all queued work to complete
await crawler.done();

Constructor

const crawler = new Crawler(crawlerOptions: ICrawlerOptions, http?: Rezo);

Parameters:

  • crawlerOptions — an ICrawlerOptions object or a CrawlerOptions builder instance that defines all crawler behavior
  • http (optional) — an existing Rezo HTTP client instance. If omitted, the crawler creates its own Rezo instance configured from the options.

Event Handlers

The crawler provides a set of callback-based event handlers:

// Called for every HTML document fetched
crawler.onDocument(async function (document, response) {
  // document: parsed DOM (linkedom)
  // response: { url, status, headers, data, contentType, finalUrl }
});

// Called for each anchor element found on a page
crawler.onAnchor(async function (anchor) {
  // anchor: { element, href, linkText, rel, isNofollow }
});

// Called for each text node matching a CSS selector
crawler.onText('h1', async function (text) {
  // text: string content
  // `this` is the DOM element (use function syntax, not arrow)
});

// Called when an email address is discovered
crawler.onEmailDiscovered(async function (event) {
  // event: { email, discoveredAt, timestamp, metadata }
});

// Called when a redirect is detected
crawler.onRedirect(async function (event) {
  // event: { originalUrl, finalUrl, redirectCount, statusCode }
});

Visiting URLs

// Visit a single URL
await crawler.visit('https://example.com/page');

// Visit with HTTP method and body
await crawler.visit('https://example.com/api', 'POST', { query: 'data' });

// Visit through Oxylabs proxy
await crawler.visitOxylabs('https://example.com/page', { geoLocation: 'United States' });

// Visit through Decodo proxy
await crawler.visitDecodo('https://example.com/page', { country: 'Germany' });

Stealth Integration

Pass a RezoStealth instance to the crawler for browser-like fingerprinting:

import { Crawler } from 'rezo/crawler';
import { RezoStealth } from 'rezo/stealth';

const crawler = new Crawler({
  baseUrl: 'https://protected-site.com',
  stealth: RezoStealth.chrome(),
  concurrency: 10
});

With stealth enabled, every HTTP request uses the stealth profile’s TLS fingerprint, header ordering, client hints, and User-Agent — overriding useRndUserAgent if set.

AutoThrottle

AutoThrottle dynamically adjusts request delay based on server response times:

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  autoThrottle: true,
  autoThrottleTargetDelay: 1000,  // Target 1 request/second
  autoThrottleMinDelay: 100,      // Never faster than 100ms between requests
  autoThrottleMaxDelay: 60000     // Never slower than 60s between requests
});

Signal Handlers

Enable graceful shutdown that persists crawl state on SIGINT/SIGTERM:

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  enableSignalHandlers: true,
  enableNavigationHistory: true,
  sessionId: 'my-crawl-session'
});

// If the process is interrupted (Ctrl+C), the crawler:
// 1. Pauses the queue
// 2. Saves the session state to SQLite
// 3. Exits cleanly
// On next run with the same sessionId, it resumes from where it left off

Adapter Selection

The crawler supports multiple HTTP adapters:

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  adapter: 'http'   // Standard Node.js HTTP (default)
  // adapter: 'http2'  // HTTP/2 with session pooling
  // adapter: 'curl'   // cURL for maximum compatibility
  // adapter: 'fetch'  // Fetch API
});

Export Results

Export collected data in JSON, JSONL, or CSV format:

// Collect results
const results = [];
crawler.onDocument(async function (doc, res) {
  results.push({ url: res.url, title: doc.querySelector('title')?.textContent });
});

await crawler.visit('https://example.com');
await crawler.done();

// Export
await crawler.export(results, './output.jsonl', 'jsonl');
await crawler.export(results, './output.csv', 'csv');

Complete Example

import { Crawler, CrawlerOptions } from 'rezo/crawler';
import { RezoStealth } from 'rezo/stealth';

const options = new CrawlerOptions({
  baseUrl: 'https://example.com',
  concurrency: 30,
  scraperConcurrency: 10,
  maxDepth: 5,
  maxUrls: 5000,
  timeout: 20000,
  enableCache: true,
  cacheTTL: 86400000,
  respectRobotsTxt: true,
  autoThrottle: true,
  enableNavigationHistory: true,
  enableSignalHandlers: true
});

options.addProxy({
  domain: 'example.com',
  proxy: { host: 'proxy.example.com', port: 8080 }
});

options.addLimiter({
  domain: 'example.com',
  options: { concurrency: 5, interval: 1000, intervalCap: 3 }
});

const crawler = new Crawler(options);

crawler.onDocument(async function (doc, res) {
  console.log(`Crawled: ${res.url}`);
});

await crawler.visit('https://example.com');
await crawler.done();