Result Streaming

The ResultStream class writes crawler results directly to disk instead of accumulating them in memory. This prevents out-of-memory crashes on long-running crawls that produce millions of results. It supports JSONL and CSV output formats, periodic flushing for durability, and automatic file rotation when files exceed a size limit.

import { ResultStream } from 'rezo/crawler';

Basic Usage

const stream = new ResultStream({
  outputPath: './results.jsonl'
});

// Write results one at a time
stream.write({ url: 'https://example.com', title: 'Example', status: 200 });
stream.write({ url: 'https://example.com/about', title: 'About', status: 200 });

// Close when done
await stream.close();

Configuration

interface ResultStreamOptions {
  /** Output file path (required) */
  outputPath: string;

  /** Output format: 'jsonl' or 'csv' (default: 'jsonl') */
  format?: 'jsonl' | 'csv';

  /** Flush to disk after this many writes (default: 100) */
  flushInterval?: number;

  /** Max file size in bytes before rotation, 0 = no rotation (default: 0) */
  maxFileSize?: number;

  /** CSV column headers (required if format is 'csv') */
  csvHeaders?: string[];
}

JSONL Output

JSONL (JSON Lines) writes one JSON object per line, making it easy to process with streaming tools:

const stream = new ResultStream({
  outputPath: './output/results.jsonl',
  flushInterval: 50  // Flush every 50 writes
});

stream.write({ url: 'https://example.com', title: 'Home', price: null });
stream.write({ url: 'https://example.com/product/1', title: 'Widget', price: 29.99 });

File contents:

{"url":"https://example.com","title":"Home","price":null}
{"url":"https://example.com/product/1","title":"Widget","price":29.99}

CSV Output

CSV output with configurable headers:

const stream = new ResultStream({
  outputPath: './output/results.csv',
  format: 'csv',
  csvHeaders: ['url', 'title', 'price', 'category']
});

stream.write({ url: 'https://example.com/product/1', title: 'Widget', price: 29.99, category: 'tools' });
stream.write({ url: 'https://example.com/product/2', title: 'Gadget, Deluxe', price: 49.99, category: 'electronics' });

File contents:

url,title,price,category
https://example.com/product/1,Widget,29.99,tools
https://example.com/product/2,"Gadget, Deluxe",49.99,electronics

CSV values containing commas, quotes, or newlines are automatically escaped with RFC 4180 quoting rules. If csvHeaders is omitted, the headers are auto-detected from the keys of the first written object.

Batch Writing

Write multiple results at once:

const results = [
  { url: 'https://example.com/a', title: 'Page A' },
  { url: 'https://example.com/b', title: 'Page B' },
  { url: 'https://example.com/c', title: 'Page C' }
];

stream.writeMany(results);

File Rotation

When maxFileSize is set, the stream automatically rotates to a new file when the current file exceeds the limit:

const stream = new ResultStream({
  outputPath: './output/results.jsonl',
  maxFileSize: 100 * 1024 * 1024  // 100MB per file
});

// Files created:
// ./output/results.jsonl       (first 100MB)
// ./output/results.1.jsonl     (next 100MB)
// ./output/results.2.jsonl     (next 100MB)
// ...

The rotation preserves the file extension and inserts an incrementing index before it.

Flush Behavior

Results are buffered by the Node.js WriteStream and flushed to disk periodically:

Every flushInterval writes (default: 100), the stream is corked and uncorked on the next tick
On close(), all buffered data is flushed before the stream ends
The append flag ('a') ensures data is never lost on restart

Manual Flush

stream.flush();
// Forces buffered data to be written to disk

Stream Properties

console.log(stream.recordCount);  // Number of records written
console.log(stream.totalBytes);   // Total bytes written
console.log(stream.isClosed);     // Whether the stream is closed
console.log(stream.outputPath);   // Current output file path (accounts for rotation)

Integration with Crawler

import { Crawler, ResultStream } from 'rezo/crawler';

const stream = new ResultStream({
  outputPath: './crawl-results.jsonl',
  flushInterval: 100,
  maxFileSize: 50 * 1024 * 1024  // 50MB per file
});

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  concurrency: 20,
  maxUrls: 100000
});

crawler.onDocument(async function (document) {
  const title = document.querySelector('title')?.textContent || '';
  const h1 = document.querySelector('h1')?.textContent || '';

  stream.write({
    url: document.location?.href,
    title,
    h1,
    crawledAt: new Date().toISOString()
  });
});

await crawler.visit('https://example.com');
await crawler.done();
await stream.close();

console.log(`Wrote ${stream.recordCount} records (${(stream.totalBytes / 1024 / 1024).toFixed(1)}MB)`);

CappedArray + ResultStream Pattern

Combine CappedArray with ResultStream for memory-bounded collection that automatically streams overflow to disk:

import { CappedArray, ResultStream } from 'rezo/crawler';

const stream = new ResultStream({ outputPath: './results.jsonl' });

const results = new CappedArray({
  maxSize: 50000,
  onEviction: (evicted) => {
    // When in-memory capacity is exceeded, evicted items go to disk
    stream.writeMany(evicted);
  }
});

// Collect results -- memory stays bounded
crawler.onDocument(async function (doc) {
  results.push({
    url: doc.location?.href,
    title: doc.querySelector('title')?.textContent
  });
});

await crawler.done();

// Flush remaining in-memory items to disk
stream.writeMany(results.toArray());
await stream.close();