Result Streaming
The ResultStream class writes crawler results directly to disk instead of accumulating them in memory. This prevents out-of-memory crashes on long-running crawls that produce millions of results. It supports JSONL and CSV output formats, periodic flushing for durability, and automatic file rotation when files exceed a size limit.
import { ResultStream } from 'rezo/crawler'; Basic Usage
const stream = new ResultStream({
outputPath: './results.jsonl'
});
// Write results one at a time
stream.write({ url: 'https://example.com', title: 'Example', status: 200 });
stream.write({ url: 'https://example.com/about', title: 'About', status: 200 });
// Close when done
await stream.close(); Configuration
interface ResultStreamOptions {
/** Output file path (required) */
outputPath: string;
/** Output format: 'jsonl' or 'csv' (default: 'jsonl') */
format?: 'jsonl' | 'csv';
/** Flush to disk after this many writes (default: 100) */
flushInterval?: number;
/** Max file size in bytes before rotation, 0 = no rotation (default: 0) */
maxFileSize?: number;
/** CSV column headers (required if format is 'csv') */
csvHeaders?: string[];
} JSONL Output
JSONL (JSON Lines) writes one JSON object per line, making it easy to process with streaming tools:
const stream = new ResultStream({
outputPath: './output/results.jsonl',
flushInterval: 50 // Flush every 50 writes
});
stream.write({ url: 'https://example.com', title: 'Home', price: null });
stream.write({ url: 'https://example.com/product/1', title: 'Widget', price: 29.99 }); File contents:
{"url":"https://example.com","title":"Home","price":null}
{"url":"https://example.com/product/1","title":"Widget","price":29.99} CSV Output
CSV output with configurable headers:
const stream = new ResultStream({
outputPath: './output/results.csv',
format: 'csv',
csvHeaders: ['url', 'title', 'price', 'category']
});
stream.write({ url: 'https://example.com/product/1', title: 'Widget', price: 29.99, category: 'tools' });
stream.write({ url: 'https://example.com/product/2', title: 'Gadget, Deluxe', price: 49.99, category: 'electronics' }); File contents:
url,title,price,category
https://example.com/product/1,Widget,29.99,tools
https://example.com/product/2,"Gadget, Deluxe",49.99,electronics CSV values containing commas, quotes, or newlines are automatically escaped with RFC 4180 quoting rules. If csvHeaders is omitted, the headers are auto-detected from the keys of the first written object.
Batch Writing
Write multiple results at once:
const results = [
{ url: 'https://example.com/a', title: 'Page A' },
{ url: 'https://example.com/b', title: 'Page B' },
{ url: 'https://example.com/c', title: 'Page C' }
];
stream.writeMany(results); File Rotation
When maxFileSize is set, the stream automatically rotates to a new file when the current file exceeds the limit:
const stream = new ResultStream({
outputPath: './output/results.jsonl',
maxFileSize: 100 * 1024 * 1024 // 100MB per file
});
// Files created:
// ./output/results.jsonl (first 100MB)
// ./output/results.1.jsonl (next 100MB)
// ./output/results.2.jsonl (next 100MB)
// ... The rotation preserves the file extension and inserts an incrementing index before it.
Flush Behavior
Results are buffered by the Node.js WriteStream and flushed to disk periodically:
- Every
flushIntervalwrites (default: 100), the stream is corked and uncorked on the next tick - On
close(), all buffered data is flushed before the stream ends - The append flag (
'a') ensures data is never lost on restart
Manual Flush
stream.flush();
// Forces buffered data to be written to disk Stream Properties
console.log(stream.recordCount); // Number of records written
console.log(stream.totalBytes); // Total bytes written
console.log(stream.isClosed); // Whether the stream is closed
console.log(stream.outputPath); // Current output file path (accounts for rotation) Integration with Crawler
import { Crawler, ResultStream } from 'rezo/crawler';
const stream = new ResultStream({
outputPath: './crawl-results.jsonl',
flushInterval: 100,
maxFileSize: 50 * 1024 * 1024 // 50MB per file
});
const crawler = new Crawler({
baseUrl: 'https://example.com',
concurrency: 20,
maxUrls: 100000
});
crawler.onDocument(async function (document, response) {
const title = document.querySelector('title')?.textContent || '';
const h1 = document.querySelector('h1')?.textContent || '';
stream.write({
url: response.url,
finalUrl: response.finalUrl,
status: response.status,
title,
h1,
crawledAt: new Date().toISOString()
});
});
await crawler.visit('https://example.com');
await crawler.done();
await stream.close();
console.log(`Wrote ${stream.recordCount} records (${(stream.totalBytes / 1024 / 1024).toFixed(1)}MB)`); CappedArray + ResultStream Pattern
Combine CappedArray with ResultStream for memory-bounded collection that automatically streams overflow to disk:
import { CappedArray, ResultStream } from 'rezo/crawler';
const stream = new ResultStream({ outputPath: './results.jsonl' });
const results = new CappedArray({
maxSize: 50000,
onEviction: (evicted) => {
// When in-memory capacity is exceeded, evicted items go to disk
stream.writeMany(evicted);
}
});
// Collect results -- memory stays bounded
crawler.onDocument(async function (doc, res) {
results.push({ url: res.url, title: doc.querySelector('title')?.textContent });
});
await crawler.done();
// Flush remaining in-memory items to disk
stream.writeMany(results.toArray());
await stream.close();