URL Management
The crawler needs to track which URLs have been visited to avoid redundant requests. Rezo provides three data structures for this: UrlStore for persistent URL tracking, CappedMap for bounded key-value storage, and CappedArray for bounded result collection.
UrlStore
UrlStore is a high-performance SQLite-backed URL tracker optimized for crawler workloads. It combines an in-memory LRU cache (100,000 entries by default) with persistent SQLite storage, providing O(1) lookups for hot URLs and durable storage for the full history.
import { UrlStore } from 'rezo/crawler'; Architecture
- SQLite WAL mode — Write-Ahead Logging for non-blocking concurrent reads/writes
- SHA-256 hashing — URLs longer than 200 characters are hashed to fixed 64-char keys
- In-memory LRU —
CappedMapwith configurable capacity (default: 100,000) for fast lookups - Batch writes — Writes are buffered and flushed in transactions for throughput
- WITHOUT ROWID — SQLite table optimized for primary key lookups
Creating a UrlStore
const store = await UrlStore.create({
storeDir: '/tmp/my-crawler/urls', // Database directory
dbFileName: 'urls.db', // Database filename
ttl: 86400000, // 24 hour TTL (default: 7 days)
maxUrls: 500000, // Max stored URLs, 0 = unlimited
hashUrls: true, // Hash long URLs (default: true)
inMemoryMaxUrls: 100000 // In-memory LRU capacity (default: 100,000)
}); Marking URLs as Visited
// Mark a single URL
await store.set('https://example.com/page1');
// Mark with a namespace
await store.set('https://example.com/page1', 'example.com');
// Mark with custom TTL (ms)
await store.set('https://example.com/page1', 'default', 3600000); Batch Operations
// Mark many URLs at once (single transaction)
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];
await store.setMany(urls, 'example.com'); Checking Visited URLs
// Check single URL (uses in-memory fast path first)
const visited = await store.has('https://example.com/page1');
// true or false
// Check single URL in a specific namespace
const visited = await store.has('https://example.com/page1', 'example.com'); Filtering Unvisited URLs
The most common crawler operation — given a list of discovered URLs, return only the ones not yet visited:
const discoveredUrls = [
'https://example.com/page1', // already visited
'https://example.com/page2', // already visited
'https://example.com/page3', // new
'https://example.com/page4' // new
];
const unvisited = await store.filterUnvisited(discoveredUrls);
// ['https://example.com/page3', 'https://example.com/page4'] Statistics and Cleanup
// Get URL count
const count = await store.count();
const nsCount = await store.count('example.com');
// Get detailed stats
const stats = await store.stats();
// { total: 15000, expired: 200, namespaces: 3 }
// Remove expired entries
const removed = await store.cleanup();
console.log(`Cleaned up ${removed} expired URLs`);
// Clear all URLs
await store.clear();
// Clear a specific namespace
await store.clear('example.com'); Closing
Always close the store when done to flush pending writes and checkpoint the WAL:
await store.close(); CappedMap
A Map with automatic LRU eviction when the size limit is reached. Used internally by UrlStore for its in-memory cache and available for general use.
import { CappedMap } from 'rezo/crawler'; Usage
const map = new CappedMap<string, number>({
maxSize: 10000, // Max entries (default: 10,000)
evictionRatio: 0.1 // Evict 10% when full (default: 0.1)
});
map.set('key1', 100);
map.set('key2', 200);
const value = map.get('key1'); // 100 (does not affect LRU order)
const value = map.getAndTouch('key1'); // 100 (moves to most recent)
map.has('key1'); // true
map.delete('key1');
map.clear();
console.log(map.size); // Current entry count LRU Behavior
When maxSize is reached, the oldest evictionRatio * maxSize entries are removed in a single batch. The set() method moves existing keys to the most recent position. Use getAndTouch() instead of get() if you want reads to refresh the LRU position.
const map = new CappedMap<string, any>({ maxSize: 1000, evictionRatio: 0.1 });
// When the 1001st entry is added:
// - The oldest 100 entries (10%) are evicted
// - The new entry is added Iteration
for (const [key, value] of map.entries()) {
console.log(key, value);
}
map.forEach((value, key) => {
console.log(key, value);
});
// Export to a regular Map
const regularMap = map.toMap(); CappedArray
A bounded array with automatic LRU eviction and optional eviction callbacks. Useful for collecting crawler results without unbounded memory growth.
import { CappedArray } from 'rezo/crawler'; Usage
const results = new CappedArray<any>({
maxSize: 100000, // Max items (default: 100,000)
evictionRatio: 0.1, // Evict 10% when full (default: 0.1)
onEviction: (evicted, remaining) => {
// Write evicted items to disk before they're lost
console.log(`Evicted ${evicted.length} items, ${remaining} remaining`);
}
});
// Add items
results.push({ url: 'https://example.com', title: 'Example' });
// Add multiple items
results.pushMany([
{ url: 'https://example.com/a', title: 'Page A' },
{ url: 'https://example.com/b', title: 'Page B' }
]);
// Access items
const first = results.get(0);
console.log(results.length);
// Check capacity
console.log(results.isAtCapacity); // true when length >= maxSize
console.log(results.maxCapacity); // The configured maxSize
// Iterate
for (const item of results) {
console.log(item.url);
}
// Export to regular array
const array = results.toArray();
// Array-like methods
const filtered = results.filter(item => item.title.includes('Example'));
const mapped = results.map(item => item.url);
results.forEach((item, index) => console.log(index, item)); Eviction Callback Pattern
Use the eviction callback to stream results to disk before they are removed from memory:
import { CappedArray } from 'rezo/crawler';
import { ResultStream } from 'rezo/crawler';
const stream = new ResultStream({ outputPath: './results.jsonl' });
const results = new CappedArray({
maxSize: 50000,
onEviction: (evicted) => {
// Automatically flush evicted items to disk
stream.writeMany(evicted);
}
});
// Collect results normally -- oldest items auto-flush to disk
crawler.onDocument(async function (doc, res) {
results.push({ url: res.url, title: doc.querySelector('title')?.textContent });
});