Crawler

URL Management

The crawler needs to track which URLs have been visited to avoid redundant requests. Rezo provides three data structures for this: UrlStore for persistent URL tracking, CappedMap for bounded key-value storage, and CappedArray for bounded result collection.

UrlStore

UrlStore is a high-performance SQLite-backed URL tracker optimized for crawler workloads. It combines an in-memory LRU cache (100,000 entries by default) with persistent SQLite storage, providing O(1) lookups for hot URLs and durable storage for the full history.

import { UrlStore } from 'rezo/crawler';

Architecture

  • SQLite WAL mode — Write-Ahead Logging for non-blocking concurrent reads/writes
  • SHA-256 hashing — URLs longer than 200 characters are hashed to fixed 64-char keys
  • In-memory LRUCappedMap with configurable capacity (default: 100,000) for fast lookups
  • Batch writes — Writes are buffered and flushed in transactions for throughput
  • WITHOUT ROWID — SQLite table optimized for primary key lookups

Creating a UrlStore

const store = await UrlStore.create({
  storeDir: '/tmp/my-crawler/urls',  // Database directory
  dbFileName: 'urls.db',             // Database filename
  ttl: 86400000,                     // 24 hour TTL (default: 7 days)
  maxUrls: 500000,                   // Max stored URLs, 0 = unlimited
  hashUrls: true,                    // Hash long URLs (default: true)
  inMemoryMaxUrls: 100000            // In-memory LRU capacity (default: 100,000)
});

Marking URLs as Visited

// Mark a single URL
await store.set('https://example.com/page1');

// Mark with a namespace
await store.set('https://example.com/page1', 'example.com');

// Mark with custom TTL (ms)
await store.set('https://example.com/page1', 'default', 3600000);

Batch Operations

// Mark many URLs at once (single transaction)
const urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  'https://example.com/page3'
];
await store.setMany(urls, 'example.com');

Checking Visited URLs

// Check single URL (uses in-memory fast path first)
const visited = await store.has('https://example.com/page1');
// true or false

// Check single URL in a specific namespace
const visited = await store.has('https://example.com/page1', 'example.com');

Filtering Unvisited URLs

The most common crawler operation — given a list of discovered URLs, return only the ones not yet visited:

const discoveredUrls = [
  'https://example.com/page1',  // already visited
  'https://example.com/page2',  // already visited
  'https://example.com/page3',  // new
  'https://example.com/page4'   // new
];

const unvisited = await store.filterUnvisited(discoveredUrls);
// ['https://example.com/page3', 'https://example.com/page4']

Statistics and Cleanup

// Get URL count
const count = await store.count();
const nsCount = await store.count('example.com');

// Get detailed stats
const stats = await store.stats();
// { total: 15000, expired: 200, namespaces: 3 }

// Remove expired entries
const removed = await store.cleanup();
console.log(`Cleaned up ${removed} expired URLs`);

// Clear all URLs
await store.clear();

// Clear a specific namespace
await store.clear('example.com');

Closing

Always close the store when done to flush pending writes and checkpoint the WAL:

await store.close();

CappedMap

A Map with automatic LRU eviction when the size limit is reached. Used internally by UrlStore for its in-memory cache and available for general use.

import { CappedMap } from 'rezo/crawler';

Usage

const map = new CappedMap<string, number>({
  maxSize: 10000,      // Max entries (default: 10,000)
  evictionRatio: 0.1   // Evict 10% when full (default: 0.1)
});

map.set('key1', 100);
map.set('key2', 200);

const value = map.get('key1');     // 100 (does not affect LRU order)
const value = map.getAndTouch('key1'); // 100 (moves to most recent)

map.has('key1');  // true
map.delete('key1');
map.clear();

console.log(map.size); // Current entry count

LRU Behavior

When maxSize is reached, the oldest evictionRatio * maxSize entries are removed in a single batch. The set() method moves existing keys to the most recent position. Use getAndTouch() instead of get() if you want reads to refresh the LRU position.

const map = new CappedMap<string, any>({ maxSize: 1000, evictionRatio: 0.1 });

// When the 1001st entry is added:
// - The oldest 100 entries (10%) are evicted
// - The new entry is added

Iteration

for (const [key, value] of map.entries()) {
  console.log(key, value);
}

map.forEach((value, key) => {
  console.log(key, value);
});

// Export to a regular Map
const regularMap = map.toMap();

CappedArray

A bounded array with automatic LRU eviction and optional eviction callbacks. Useful for collecting crawler results without unbounded memory growth.

import { CappedArray } from 'rezo/crawler';

Usage

const results = new CappedArray<any>({
  maxSize: 100000,      // Max items (default: 100,000)
  evictionRatio: 0.1,   // Evict 10% when full (default: 0.1)
  onEviction: (evicted, remaining) => {
    // Write evicted items to disk before they're lost
    console.log(`Evicted ${evicted.length} items, ${remaining} remaining`);
  }
});

// Add items
results.push({ url: 'https://example.com', title: 'Example' });

// Add multiple items
results.pushMany([
  { url: 'https://example.com/a', title: 'Page A' },
  { url: 'https://example.com/b', title: 'Page B' }
]);

// Access items
const first = results.get(0);
console.log(results.length);

// Check capacity
console.log(results.isAtCapacity);  // true when length >= maxSize
console.log(results.maxCapacity);   // The configured maxSize

// Iterate
for (const item of results) {
  console.log(item.url);
}

// Export to regular array
const array = results.toArray();

// Array-like methods
const filtered = results.filter(item => item.title.includes('Example'));
const mapped = results.map(item => item.url);
results.forEach((item, index) => console.log(index, item));

Eviction Callback Pattern

Use the eviction callback to stream results to disk before they are removed from memory:

import { CappedArray } from 'rezo/crawler';
import { ResultStream } from 'rezo/crawler';

const stream = new ResultStream({ outputPath: './results.jsonl' });

const results = new CappedArray({
  maxSize: 50000,
  onEviction: (evicted) => {
    // Automatically flush evicted items to disk
    stream.writeMany(evicted);
  }
});

// Collect results normally -- oldest items auto-flush to disk
crawler.onDocument(async function (doc, res) {
  results.push({ url: res.url, title: doc.querySelector('title')?.textContent });
});