Crawler

Result Streaming

The ResultStream class writes crawler results directly to disk instead of accumulating them in memory. This prevents out-of-memory crashes on long-running crawls that produce millions of results. It supports JSONL and CSV output formats, periodic flushing for durability, and automatic file rotation when files exceed a size limit.

import { ResultStream } from 'rezo/crawler';

Basic Usage

const stream = new ResultStream({
  outputPath: './results.jsonl'
});

// Write results one at a time
stream.write({ url: 'https://example.com', title: 'Example', status: 200 });
stream.write({ url: 'https://example.com/about', title: 'About', status: 200 });

// Close when done
await stream.close();

Configuration

interface ResultStreamOptions {
  /** Output file path (required) */
  outputPath: string;

  /** Output format: 'jsonl' or 'csv' (default: 'jsonl') */
  format?: 'jsonl' | 'csv';

  /** Flush to disk after this many writes (default: 100) */
  flushInterval?: number;

  /** Max file size in bytes before rotation, 0 = no rotation (default: 0) */
  maxFileSize?: number;

  /** CSV column headers (required if format is 'csv') */
  csvHeaders?: string[];
}

JSONL Output

JSONL (JSON Lines) writes one JSON object per line, making it easy to process with streaming tools:

const stream = new ResultStream({
  outputPath: './output/results.jsonl',
  flushInterval: 50  // Flush every 50 writes
});

stream.write({ url: 'https://example.com', title: 'Home', price: null });
stream.write({ url: 'https://example.com/product/1', title: 'Widget', price: 29.99 });

File contents:

{"url":"https://example.com","title":"Home","price":null}
{"url":"https://example.com/product/1","title":"Widget","price":29.99}

CSV Output

CSV output with configurable headers:

const stream = new ResultStream({
  outputPath: './output/results.csv',
  format: 'csv',
  csvHeaders: ['url', 'title', 'price', 'category']
});

stream.write({ url: 'https://example.com/product/1', title: 'Widget', price: 29.99, category: 'tools' });
stream.write({ url: 'https://example.com/product/2', title: 'Gadget, Deluxe', price: 49.99, category: 'electronics' });

File contents:

url,title,price,category
https://example.com/product/1,Widget,29.99,tools
https://example.com/product/2,"Gadget, Deluxe",49.99,electronics

CSV values containing commas, quotes, or newlines are automatically escaped with RFC 4180 quoting rules. If csvHeaders is omitted, the headers are auto-detected from the keys of the first written object.

Batch Writing

Write multiple results at once:

const results = [
  { url: 'https://example.com/a', title: 'Page A' },
  { url: 'https://example.com/b', title: 'Page B' },
  { url: 'https://example.com/c', title: 'Page C' }
];

stream.writeMany(results);

File Rotation

When maxFileSize is set, the stream automatically rotates to a new file when the current file exceeds the limit:

const stream = new ResultStream({
  outputPath: './output/results.jsonl',
  maxFileSize: 100 * 1024 * 1024  // 100MB per file
});

// Files created:
// ./output/results.jsonl       (first 100MB)
// ./output/results.1.jsonl     (next 100MB)
// ./output/results.2.jsonl     (next 100MB)
// ...

The rotation preserves the file extension and inserts an incrementing index before it.

Flush Behavior

Results are buffered by the Node.js WriteStream and flushed to disk periodically:

  • Every flushInterval writes (default: 100), the stream is corked and uncorked on the next tick
  • On close(), all buffered data is flushed before the stream ends
  • The append flag ('a') ensures data is never lost on restart

Manual Flush

stream.flush();
// Forces buffered data to be written to disk

Stream Properties

console.log(stream.recordCount);  // Number of records written
console.log(stream.totalBytes);   // Total bytes written
console.log(stream.isClosed);     // Whether the stream is closed
console.log(stream.outputPath);   // Current output file path (accounts for rotation)

Integration with Crawler

import { Crawler, ResultStream } from 'rezo/crawler';

const stream = new ResultStream({
  outputPath: './crawl-results.jsonl',
  flushInterval: 100,
  maxFileSize: 50 * 1024 * 1024  // 50MB per file
});

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  concurrency: 20,
  maxUrls: 100000
});

crawler.onDocument(async function (document, response) {
  const title = document.querySelector('title')?.textContent || '';
  const h1 = document.querySelector('h1')?.textContent || '';

  stream.write({
    url: response.url,
    finalUrl: response.finalUrl,
    status: response.status,
    title,
    h1,
    crawledAt: new Date().toISOString()
  });
});

await crawler.visit('https://example.com');
await crawler.done();
await stream.close();

console.log(`Wrote ${stream.recordCount} records (${(stream.totalBytes / 1024 / 1024).toFixed(1)}MB)`);

CappedArray + ResultStream Pattern

Combine CappedArray with ResultStream for memory-bounded collection that automatically streams overflow to disk:

import { CappedArray, ResultStream } from 'rezo/crawler';

const stream = new ResultStream({ outputPath: './results.jsonl' });

const results = new CappedArray({
  maxSize: 50000,
  onEviction: (evicted) => {
    // When in-memory capacity is exceeded, evicted items go to disk
    stream.writeMany(evicted);
  }
});

// Collect results -- memory stays bounded
crawler.onDocument(async function (doc, res) {
  results.push({ url: res.url, title: doc.querySelector('title')?.textContent });
});

await crawler.done();

// Flush remaining in-memory items to disk
stream.writeMany(results.toArray());
await stream.close();