Crawler Configuration
The crawler is configured through the ICrawlerOptions interface or the CrawlerOptions builder class. The builder provides fluent methods for adding proxies, rate limiters, headers, and proxy service integrations.
import { CrawlerOptions, type ICrawlerOptions } from 'rezo/crawler'; ICrawlerOptions Interface
Request Settings
interface ICrawlerOptions {
/** Starting URL for crawl operations */
baseUrl: string;
/** HTTP adapter: 'http' | 'http2' | 'curl' | 'fetch' (default: 'http') */
adapter?: CrawlerAdapterType;
/** Reject unauthorized SSL certificates (default: true) */
rejectUnauthorized?: boolean;
/** Custom User-Agent string */
userAgent?: string;
/** Randomize User-Agent per request (default: false) */
useRndUserAgent?: boolean;
/** Request timeout in ms (default: 30000) */
timeout?: number;
/** Maximum redirects to follow (default: 10) */
maxRedirects?: number; Retry Logic
/** Max retry attempts for failed requests (default: 3) */
maxRetryAttempts?: number;
/** Delay between retries in ms (default: 0) */
retryDelay?: number;
/** Status codes that trigger retry (default: [408, 429, 500, 502, 503, 504]) */
retryOnStatusCode?: number[];
/** Status codes that trigger retry without proxy (default: [407, 403]) */
retryWithoutProxyOnStatusCode?: number[];
/** Retry on proxy errors (default: true) */
retryOnProxyError?: boolean;
/** Max retries for proxy errors (default: 3) */
maxRetryOnProxyError?: number; Caching
/** Enable response caching (default: true) */
enableCache?: boolean;
/** Cache TTL in ms (default: 604800000 = 7 days) */
cacheTTL?: number;
/** Cache storage directory (default: './cache') */
cacheDir?: string; Crawl Limits
/** Max concurrent crawler requests (default: 100) */
concurrency?: number;
/** Max concurrent scraper requests, separate queue (default: same as concurrency) */
scraperConcurrency?: number;
/** Max crawl depth from start URL, 0 = unlimited (default: 0) */
maxDepth?: number;
/** Max total URLs to crawl, 0 = unlimited (default: 0) */
maxUrls?: number;
/** Max response body size in bytes, 0 = unlimited (default: 0) */
maxResponseSize?: number; robots.txt and Link Handling
/** Respect robots.txt rules (default: false) */
respectRobotsTxt?: boolean;
/** Follow rel="nofollow" links (default: false) */
followNofollow?: boolean; AutoThrottle
/** Enable adaptive throttling based on response times (default: true) */
autoThrottle?: boolean;
/** Target delay in ms for auto-throttle (default: 1000) */
autoThrottleTargetDelay?: number;
/** Minimum delay between requests in ms (default: 100) */
autoThrottleMinDelay?: number;
/** Maximum delay between requests in ms (default: 60000) */
autoThrottleMaxDelay?: number; 429 Rate Limit Handling
/** Max time to wait on 429 response in ms (default: 1800000 = 30 min) */
maxWaitOn429?: number;
/** Always wait on 429 regardless of duration, shows warning (default: false) */
alwaysWaitOn429?: boolean; Session and Resumability
/** Enable navigation history for resumable crawling (default: false) */
enableNavigationHistory?: boolean;
/** Session ID for navigation history resume */
sessionId?: string;
/** Enable SIGINT/SIGTERM graceful shutdown handlers (default: false) */
enableSignalHandlers?: boolean; Stealth
/** Global browser fingerprint stealth. Overrides useRndUserAgent. */
stealth?: RezoStealth;
} CrawlerOptions Builder
The CrawlerOptions class wraps ICrawlerOptions and provides builder methods for domain-specific configurations:
const options = new CrawlerOptions({
baseUrl: 'https://example.com',
concurrency: 20,
timeout: 15000
}); addProxy()
Add a proxy configuration for specific domains or globally:
// Domain-specific proxy
options.addProxy({
domain: 'api.example.com',
proxy: { host: 'proxy.example.com', port: 8080 }
});
// Global proxy (applies to all domains)
options.addProxy({
isGlobal: true,
proxy: { host: 'proxy.example.com', port: 8080, username: 'user', password: 'pass' },
rotating: true // Proxy rotates IPs between requests
}); addLimiter()
Add rate limiting for specific domains or globally:
// Domain-specific rate limit
options.addLimiter({
domain: 'example.com',
options: {
concurrency: 2, // Max 2 concurrent requests to this domain
interval: 1000, // Per 1 second window
intervalCap: 2, // Max 2 requests per window
randomDelay: 150 // Random 0-150ms jitter
},
retry: {
enable: true,
max429Retries: 3, // Retry up to 3 times on 429
retryDelay: 1000, // 1 second base delay
backoff: true // Exponential backoff
}
});
// Global rate limit
options.addLimiter({
isGlobal: true,
options: { concurrency: 10 }
}); addHeader()
Add custom headers for specific domains or globally:
options.addHeader({
domain: 'api.example.com',
headers: {
'Authorization': 'Bearer token123',
'X-Custom-Header': 'value'
}
}); addOxylabs()
Configure Oxylabs proxy service for specific domains:
options.addOxylabs({
domain: 'example.com',
options: {
username: 'customer-user',
password: 'pass123',
browserType: 'desktop_chrome',
geoLocation: 'United States',
locale: 'en-us'
},
queueOptions: { concurrency: 5 }
}); addDecodo()
Configure Decodo proxy service for specific domains:
options.addDecodo({
domain: 'example.com',
options: {
username: 'user',
password: 'pass',
deviceType: 'desktop',
country: 'Germany',
headless: 'html'
},
queueOptions: { concurrency: 3 }
}); addStealth()
Add domain-specific stealth profiles. Each entry creates a dedicated Rezo instance with its own stealth configuration:
import { RezoStealth } from 'rezo/stealth';
options.addStealth({
domain: 'protected-site.com',
stealth: RezoStealth.chrome()
});
options.addStealth({
domain: 'another-site.com',
stealth: new RezoStealth({ family: 'firefox', rotate: true })
}); createStableThroughputOptions()
A factory function that generates a battle-tested configuration for crawling a single domain at stable throughput. It sets up global and domain-specific rate limiters with retry policies:
import { CrawlerOptions } from 'rezo/crawler';
const options = CrawlerOptions.createStableThroughputOptions({
baseUrl: 'https://example.com',
concurrency: 40,
scraperConcurrency: 10,
retryDelay: 1000,
maxRetryAttempts: 2,
retryOnStatusCode: [408, 500, 502, 503, 504],
maxWaitOn429: 15000,
alwaysWaitOn429: false,
globalLimiter: { concurrency: 8 },
domainLimiter: {
concurrency: 2,
interval: 1000,
intervalCap: 2,
randomDelay: 150
},
domainRetry: {
enable: true,
max429Retries: 2,
retryDelay: 1000,
maxRetryAttempts: 2,
backoff: true
},
extraLimiters: [],
overrides: {}
}); Preset Parameters
| Parameter | Default | Description |
|---|---|---|
concurrency | 40 | Main crawler concurrency |
scraperConcurrency | 10 | Scraper queue concurrency |
retryDelay | 1000 | Global retry delay in ms |
maxRetryAttempts | 2 | Global retry attempts |
retryOnStatusCode | [408, 500, 502, 503, 504] | Retryable status codes |
maxWaitOn429 | 15000 | Max wait on 429 in ms |
globalLimiter | { concurrency: 8 } | Global rate limiter |
domainLimiter | { concurrency: 2, interval: 1000, intervalCap: 2, randomDelay: 150 } | Primary domain limiter |
domainRetry | { enable: true, max429Retries: 2, backoff: true } | Domain retry policy |
Domain Matching
The domain field in proxy, limiter, and header configs supports multiple formats:
// Exact domain string
domain: 'api.example.com'
// Array of domains
domain: ['api.example.com', 'cdn.example.com']
// Wildcard pattern
domain: '*.example.com'
// RegExp pattern
domain: /^(api|cdn).example.com$/ Complete Configuration Example
import { CrawlerOptions } from 'rezo/crawler';
import { RezoStealth } from 'rezo/stealth';
const options = new CrawlerOptions({
baseUrl: 'https://shop.example.com',
adapter: 'http',
concurrency: 30,
scraperConcurrency: 10,
maxDepth: 5,
maxUrls: 10000,
timeout: 20000,
maxRedirects: 5,
enableCache: true,
cacheTTL: 86400000,
respectRobotsTxt: true,
autoThrottle: true,
autoThrottleTargetDelay: 500,
maxWaitOn429: 30000,
enableNavigationHistory: true,
sessionId: 'shop-crawl-v1',
enableSignalHandlers: true,
stealth: RezoStealth.chrome()
});
// Domain rate limiting
options.addLimiter({
domain: 'shop.example.com',
options: { concurrency: 3, interval: 1000, intervalCap: 3 },
retry: { enable: true, max429Retries: 3, backoff: true, retryDelay: 2000 }
});
// Global throughput cap
options.addLimiter({
isGlobal: true,
options: { concurrency: 15 }
});
// Proxy for the target domain
options.addProxy({
domain: 'shop.example.com',
proxy: { host: 'proxy.service.com', port: 8080 },
rotating: true
});