Crawler

Crawler Configuration

The crawler is configured through the ICrawlerOptions interface or the CrawlerOptions builder class. The builder provides fluent methods for adding proxies, rate limiters, headers, and proxy service integrations.

import { CrawlerOptions, type ICrawlerOptions } from 'rezo/crawler';

ICrawlerOptions Interface

Request Settings

interface ICrawlerOptions {
  /** Starting URL for crawl operations */
  baseUrl: string;

  /** HTTP adapter: 'http' | 'http2' | 'curl' | 'fetch' (default: 'http') */
  adapter?: CrawlerAdapterType;

  /** Reject unauthorized SSL certificates (default: true) */
  rejectUnauthorized?: boolean;

  /** Custom User-Agent string */
  userAgent?: string;

  /** Randomize User-Agent per request (default: false) */
  useRndUserAgent?: boolean;

  /** Request timeout in ms (default: 30000) */
  timeout?: number;

  /** Maximum redirects to follow (default: 10) */
  maxRedirects?: number;

Retry Logic

  /** Max retry attempts for failed requests (default: 3) */
  maxRetryAttempts?: number;

  /** Delay between retries in ms (default: 0) */
  retryDelay?: number;

  /** Status codes that trigger retry (default: [408, 429, 500, 502, 503, 504]) */
  retryOnStatusCode?: number[];

  /** Status codes that trigger retry without proxy (default: [407, 403]) */
  retryWithoutProxyOnStatusCode?: number[];

  /** Retry on proxy errors (default: true) */
  retryOnProxyError?: boolean;

  /** Max retries for proxy errors (default: 3) */
  maxRetryOnProxyError?: number;

Caching

  /** Enable response caching (default: true) */
  enableCache?: boolean;

  /** Cache TTL in ms (default: 604800000 = 7 days) */
  cacheTTL?: number;

  /** Cache storage directory (default: './cache') */
  cacheDir?: string;

Crawl Limits

  /** Max concurrent crawler requests (default: 100) */
  concurrency?: number;

  /** Max concurrent scraper requests, separate queue (default: same as concurrency) */
  scraperConcurrency?: number;

  /** Max crawl depth from start URL, 0 = unlimited (default: 0) */
  maxDepth?: number;

  /** Max total URLs to crawl, 0 = unlimited (default: 0) */
  maxUrls?: number;

  /** Max response body size in bytes, 0 = unlimited (default: 0) */
  maxResponseSize?: number;
  /** Respect robots.txt rules (default: false) */
  respectRobotsTxt?: boolean;

  /** Follow rel="nofollow" links (default: false) */
  followNofollow?: boolean;

AutoThrottle

  /** Enable adaptive throttling based on response times (default: true) */
  autoThrottle?: boolean;

  /** Target delay in ms for auto-throttle (default: 1000) */
  autoThrottleTargetDelay?: number;

  /** Minimum delay between requests in ms (default: 100) */
  autoThrottleMinDelay?: number;

  /** Maximum delay between requests in ms (default: 60000) */
  autoThrottleMaxDelay?: number;

429 Rate Limit Handling

  /** Max time to wait on 429 response in ms (default: 1800000 = 30 min) */
  maxWaitOn429?: number;

  /** Always wait on 429 regardless of duration, shows warning (default: false) */
  alwaysWaitOn429?: boolean;

Session and Resumability

  /** Enable navigation history for resumable crawling (default: false) */
  enableNavigationHistory?: boolean;

  /** Session ID for navigation history resume */
  sessionId?: string;

  /** Enable SIGINT/SIGTERM graceful shutdown handlers (default: false) */
  enableSignalHandlers?: boolean;

Stealth

  /** Global browser fingerprint stealth. Overrides useRndUserAgent. */
  stealth?: RezoStealth;
}

CrawlerOptions Builder

The CrawlerOptions class wraps ICrawlerOptions and provides builder methods for domain-specific configurations:

const options = new CrawlerOptions({
  baseUrl: 'https://example.com',
  concurrency: 20,
  timeout: 15000
});

addProxy()

Add a proxy configuration for specific domains or globally:

// Domain-specific proxy
options.addProxy({
  domain: 'api.example.com',
  proxy: { host: 'proxy.example.com', port: 8080 }
});

// Global proxy (applies to all domains)
options.addProxy({
  isGlobal: true,
  proxy: { host: 'proxy.example.com', port: 8080, username: 'user', password: 'pass' },
  rotating: true  // Proxy rotates IPs between requests
});

addLimiter()

Add rate limiting for specific domains or globally:

// Domain-specific rate limit
options.addLimiter({
  domain: 'example.com',
  options: {
    concurrency: 2,        // Max 2 concurrent requests to this domain
    interval: 1000,        // Per 1 second window
    intervalCap: 2,        // Max 2 requests per window
    randomDelay: 150       // Random 0-150ms jitter
  },
  retry: {
    enable: true,
    max429Retries: 3,      // Retry up to 3 times on 429
    retryDelay: 1000,      // 1 second base delay
    backoff: true          // Exponential backoff
  }
});

// Global rate limit
options.addLimiter({
  isGlobal: true,
  options: { concurrency: 10 }
});

addHeader()

Add custom headers for specific domains or globally:

options.addHeader({
  domain: 'api.example.com',
  headers: {
    'Authorization': 'Bearer token123',
    'X-Custom-Header': 'value'
  }
});

addOxylabs()

Configure Oxylabs proxy service for specific domains:

options.addOxylabs({
  domain: 'example.com',
  options: {
    username: 'customer-user',
    password: 'pass123',
    browserType: 'desktop_chrome',
    geoLocation: 'United States',
    locale: 'en-us'
  },
  queueOptions: { concurrency: 5 }
});

addDecodo()

Configure Decodo proxy service for specific domains:

options.addDecodo({
  domain: 'example.com',
  options: {
    username: 'user',
    password: 'pass',
    deviceType: 'desktop',
    country: 'Germany',
    headless: 'html'
  },
  queueOptions: { concurrency: 3 }
});

addStealth()

Add domain-specific stealth profiles. Each entry creates a dedicated Rezo instance with its own stealth configuration:

import { RezoStealth } from 'rezo/stealth';

options.addStealth({
  domain: 'protected-site.com',
  stealth: RezoStealth.chrome()
});

options.addStealth({
  domain: 'another-site.com',
  stealth: new RezoStealth({ family: 'firefox', rotate: true })
});

createStableThroughputOptions()

A factory function that generates a battle-tested configuration for crawling a single domain at stable throughput. It sets up global and domain-specific rate limiters with retry policies:

import { CrawlerOptions } from 'rezo/crawler';

const options = CrawlerOptions.createStableThroughputOptions({
  baseUrl: 'https://example.com',
  concurrency: 40,
  scraperConcurrency: 10,
  retryDelay: 1000,
  maxRetryAttempts: 2,
  retryOnStatusCode: [408, 500, 502, 503, 504],
  maxWaitOn429: 15000,
  alwaysWaitOn429: false,
  globalLimiter: { concurrency: 8 },
  domainLimiter: {
    concurrency: 2,
    interval: 1000,
    intervalCap: 2,
    randomDelay: 150
  },
  domainRetry: {
    enable: true,
    max429Retries: 2,
    retryDelay: 1000,
    maxRetryAttempts: 2,
    backoff: true
  },
  extraLimiters: [],
  overrides: {}
});

Preset Parameters

ParameterDefaultDescription
concurrency40Main crawler concurrency
scraperConcurrency10Scraper queue concurrency
retryDelay1000Global retry delay in ms
maxRetryAttempts2Global retry attempts
retryOnStatusCode[408, 500, 502, 503, 504]Retryable status codes
maxWaitOn42915000Max wait on 429 in ms
globalLimiter{ concurrency: 8 }Global rate limiter
domainLimiter{ concurrency: 2, interval: 1000, intervalCap: 2, randomDelay: 150 }Primary domain limiter
domainRetry{ enable: true, max429Retries: 2, backoff: true }Domain retry policy

Domain Matching

The domain field in proxy, limiter, and header configs supports multiple formats:

// Exact domain string
domain: 'api.example.com'

// Array of domains
domain: ['api.example.com', 'cdn.example.com']

// Wildcard pattern
domain: '*.example.com'

// RegExp pattern
domain: /^(api|cdn).example.com$/

Complete Configuration Example

import { CrawlerOptions } from 'rezo/crawler';
import { RezoStealth } from 'rezo/stealth';

const options = new CrawlerOptions({
  baseUrl: 'https://shop.example.com',
  adapter: 'http',
  concurrency: 30,
  scraperConcurrency: 10,
  maxDepth: 5,
  maxUrls: 10000,
  timeout: 20000,
  maxRedirects: 5,
  enableCache: true,
  cacheTTL: 86400000,
  respectRobotsTxt: true,
  autoThrottle: true,
  autoThrottleTargetDelay: 500,
  maxWaitOn429: 30000,
  enableNavigationHistory: true,
  sessionId: 'shop-crawl-v1',
  enableSignalHandlers: true,
  stealth: RezoStealth.chrome()
});

// Domain rate limiting
options.addLimiter({
  domain: 'shop.example.com',
  options: { concurrency: 3, interval: 1000, intervalCap: 3 },
  retry: { enable: true, max429Retries: 3, backoff: true, retryDelay: 2000 }
});

// Global throughput cap
options.addLimiter({
  isGlobal: true,
  options: { concurrency: 15 }
});

// Proxy for the target domain
options.addProxy({
  domain: 'shop.example.com',
  proxy: { host: 'proxy.service.com', port: 8080 },
  rotating: true
});