Crawler

Caching

The FileCacher provides SQLite-based response caching optimized for crawler workloads. It stores HTTP responses as BLOBs in a single database with namespace-based separation, optional zstd compression (Node.js 22.15+), and LRU eviction to cap disk and memory usage.

import { FileCacher } from 'rezo/crawler';

Architecture

  • Single SQLite database — All domains share one database with namespace as a column, preventing file descriptor exhaustion when crawling many domains
  • WAL journal mode — Write-Ahead Logging enables non-blocking reads during writes
  • BLOB storage — Responses stored as raw binary, avoiding the 33% overhead of base64 encoding
  • WITHOUT ROWID table — Primary key is (namespace, key) for fast composite lookups
  • Memory-mapped I/O — 128MB mmap for fast reads on supported systems
  • LRU eviction — Oldest entries evicted when maxEntries is exceeded

Creating a FileCacher

const cache = await FileCacher.create({
  cacheDir: '/tmp/my-crawler/cache',  // Database directory (default: '/tmp/rezo-crawler/cache')
  dbFileName: 'cache.db',             // Database filename (default: 'cache.db')
  ttl: 86400000,                      // Default TTL in ms (default: 7 days)
  compression: true,                  // Enable zstd compression (default: false)
  maxEntries: 100000                  // Max entries before LRU eviction (default: 100,000)
});

Compression

When compression: true is set, responses are compressed with zstd before storage. This requires Node.js 22.15+ which ships native zlib.zstdCompressSync() and zlib.zstdDecompressSync(). On older Node.js versions, compression is silently disabled and data is stored uncompressed.

Storing Responses

set()

Store a single value with optional TTL and namespace:

// Store a response
await cache.set('https://example.com/page1', {
  html: '<html>...</html>',
  status: 200,
  headers: { 'content-type': 'text/html' }
});

// Store with custom TTL (1 hour)
await cache.set('https://example.com/page1', responseData, 3600000);

// Store in a domain namespace
await cache.set('https://example.com/page1', responseData, undefined, 'example.com');

Values are JSON-serialized before storage, so any JSON-compatible type works.

setMany()

Store multiple entries in a single transaction for better throughput:

await cache.setMany([
  { key: 'https://example.com/page1', value: response1 },
  { key: 'https://example.com/page2', value: response2, ttl: 3600000 },
  { key: 'https://example.com/page3', value: response3 }
], 'example.com');

Retrieving Cached Data

get()

Retrieve a cached value. Returns null if the key does not exist or has expired:

const data = await cache.get<{ html: string; status: number }>(
  'https://example.com/page1',
  'example.com'
);

if (data) {
  console.log(data.html);
  console.log(data.status);
} else {
  // Not cached or expired
}

Expired entries are automatically deleted on access.

has()

Check if a key exists and is not expired without retrieving the full value:

const exists = await cache.has('https://example.com/page1', 'example.com');

hasMany()

Batch check multiple keys:

const cachedUrls = await cache.hasMany(
  ['https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3'],
  'example.com'
);
// Returns Set<string> of keys that exist and are not expired

Deletion and Cleanup

delete()

Remove a single entry:

await cache.delete('https://example.com/page1', 'example.com');

clear()

Clear entries by namespace or all entries:

// Clear one namespace
await cache.clear('example.com');

// Clear everything
await cache.clear();

cleanup()

Remove all expired entries across all namespaces:

const removed = await cache.cleanup();
console.log(`Removed ${removed} expired entries`);

LRU Eviction

When maxEntries is configured and the entry count exceeds the limit, the cacher performs LRU eviction:

  1. First, all expired entries are deleted
  2. If still over the limit, the oldest 10% of entries (by createdAt) are removed
  3. Eviction runs asynchronously on the next event loop tick to avoid blocking writes
const cache = await FileCacher.create({
  maxEntries: 50000  // Keep at most 50,000 entries
});

// After 50,001 inserts:
// - Expired entries are purged
// - If still over limit, oldest 5,000 entries are evicted

Set maxEntries: 0 to disable eviction (unlimited growth — use with caution).

Statistics

const stats = await cache.stats();
// { count: 45000, expired: 1200, namespaces: 5 }

// Stats for a specific namespace
const nsStats = await cache.stats('example.com');
// { count: 8000, expired: 300, namespaces: 5 }

Closing

Always close the cacher when done to checkpoint the WAL and release the database connection:

await cache.close();

console.log(cache.isClosed);      // true
console.log(cache.directory);      // '/tmp/my-crawler/cache'
console.log(cache.databasePath);   // '/tmp/my-crawler/cache/cache.db'

Integration with Crawler

The crawler automatically creates and manages a FileCacher when enableCache: true:

import { Crawler } from 'rezo/crawler';

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  enableCache: true,
  cacheTTL: 86400000,    // 24 hours
  cacheDir: './my-cache'
});

// First visit fetches from network and caches
await crawler.visit('https://example.com/page1');

// Second visit serves from cache (if within TTL)
await crawler.visit('https://example.com/page1');