Crawler

Memory Management

Long-running crawls accumulate memory through DOM parsing, response caching, and result collection. The MemoryMonitor class tracks V8 heap usage in real-time and provides status levels, reports, and automatic callbacks to prevent out-of-memory crashes.

import { MemoryMonitor } from 'rezo/crawler';

Creating a MemoryMonitor

const monitor = new MemoryMonitor({
  warningRatio: 0.7,      // Warning at 70% heap usage (default: 0.7)
  criticalRatio: 0.85,    // Critical at 85% heap usage (default: 0.85)
  checkInterval: 10000    // Check every 10 seconds (default: 10000ms)
});

The monitor reads the V8 heap size limit at construction time via v8.getHeapStatistics() and uses it as the baseline for threshold calculations.

Memory Status Levels

The monitor reports three status levels:

StatusConditionMeaning
okUsage below 70% of max heapNormal operation
warningUsage between 70% and 85%Memory pressure detected, consider reducing concurrency
criticalUsage above 85% of max heapHigh risk of OOM, pause work and force GC

Manual Check

const status = monitor.check();
// 'ok' | 'warning' | 'critical'

if (status === 'critical') {
  queue.pause();
  monitor.forceGC();
  // Wait and resume
}

Memory Reports

getReport()

Get a detailed memory report:

const report = monitor.getReport();

The MemoryReport interface:

interface MemoryReport {
  status: 'ok' | 'warning' | 'critical';
  heapUsedMB: number;     // Current heap usage in MB
  heapTotalMB: number;    // Total heap allocated in MB
  heapLimitMB: number;    // V8 max heap limit in MB
  usagePercent: number;   // Heap used as percentage of limit
  externalMB: number;     // External memory (Buffers, etc.) in MB
  rssMB: number;          // Resident Set Size in MB
}

Example output:

const report = monitor.getReport();
console.log(report);
// {
//   status: 'warning',
//   heapUsedMB: 1450,
//   heapTotalMB: 1800,
//   heapLimitMB: 2048,
//   usagePercent: 70.8,
//   externalMB: 45,
//   rssMB: 1920
// }

getUsagePercent()

Quick check of heap usage as a percentage:

const percent = monitor.getUsagePercent();
console.log(`Heap usage: ${percent.toFixed(1)}%`);

Forced Garbage Collection

forceGC()

Triggers a full garbage collection cycle. Requires Node.js to be started with --expose-gc:

const success = monitor.forceGC();
if (success) {
  console.log('GC triggered');
} else {
  console.log('GC not available (start node with --expose-gc)');
}
# Start Node.js with GC exposed
node --expose-gc crawl.ts

Automatic Monitoring

startAutoMonitor()

Start a background interval that checks memory and triggers callbacks on status transitions:

monitor.startAutoMonitor({
  onWarning: (report) => {
    console.warn(`Memory warning: ${report.usagePercent}%`);
    // Reduce concurrency to slow memory growth
    queue.concurrency = 5;
  },
  onCritical: (report) => {
    console.error(`Memory critical: ${report.usagePercent}%`);
    // Pause all work and force GC
    queue.pause();
    monitor.forceGC();
  },
  onRecovered: (report) => {
    console.log(`Memory recovered: ${report.usagePercent}%`);
    // Resume normal operation
    queue.resume();
    queue.concurrency = 20;
  }
});

Callback behavior:

  • onWarning fires when status transitions to warning from any other state
  • onCritical fires when status transitions to critical from any other state
  • onRecovered fires when status transitions to ok from warning or critical
  • Callbacks only fire on transitions, not on every check interval

The monitor interval uses unref() so it does not keep the process alive.

stopAutoMonitor()

Stop the background monitoring interval:

monitor.stopAutoMonitor();

Concurrency Recommendations

getRecommendedConcurrency()

Get a recommended concurrency based on current memory pressure:

const recommended = monitor.getRecommendedConcurrency(
  currentConcurrency,  // Your current concurrency setting
  5                    // Minimum concurrency (default: 5)
);

queue.concurrency = recommended;
StatusRecommendation
okReturns currentConcurrency unchanged
warningReturns max(minConcurrency, currentConcurrency * 0.5)
criticalReturns minConcurrency

Cleanup

monitor.destroy();
// Stops auto-monitoring and releases resources

Integration with Crawler

A complete example of memory-aware crawling:

import { Crawler, MemoryMonitor } from 'rezo/crawler';

const monitor = new MemoryMonitor({
  warningRatio: 0.7,
  criticalRatio: 0.85,
  checkInterval: 5000
});

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  concurrency: 30,
  maxUrls: 100000
});

// Adaptive concurrency based on memory pressure
monitor.startAutoMonitor({
  onWarning: (report) => {
    console.warn(`[memory] Warning at ${report.usagePercent}% - reducing concurrency`);
    // Access the crawler's internal queue to adjust concurrency
  },
  onCritical: (report) => {
    console.error(`[memory] Critical at ${report.usagePercent}% - pausing`);
    monitor.forceGC();
  },
  onRecovered: (report) => {
    console.log(`[memory] Recovered at ${report.usagePercent}% - resuming`);
  }
});

// Periodic reporting
setInterval(() => {
  const report = monitor.getReport();
  console.log(`[memory] ${report.status} - ${report.heapUsedMB}MB / ${report.heapLimitMB}MB (${report.usagePercent}%)`);
}, 30000);

await crawler.visit('https://example.com');
await crawler.done();
monitor.destroy();