Resumable Crawling
The NavigationHistory class persists crawl state to SQLite, enabling sessions to be paused, interrupted, and resumed without losing progress. It tracks three things: sessions (metadata), the URL queue (pending work), and visited URLs (completed work).
import { NavigationHistory } from 'rezo/crawler'; Architecture
NavigationHistory uses a single SQLite database with three tables:
navigation.db
├── sessions -- Session metadata (ID, status, counters)
├── queue -- Pending URLs with priority and metadata
└── visited -- Completed URLs with status codes and content types All tables use WAL journal mode and are indexed for efficient lookups by session ID.
Creating a NavigationHistory
const history = await NavigationHistory.create({
storeDir: '/tmp/my-crawler/navigation', // Database directory
dbFileName: 'navigation.db', // Database filename
hashUrls: false // Hash URLs with SHA-256 (default: false)
}); When hashUrls: true, all URLs are hashed before storage for privacy and fixed-size keys.
Session Management
Creating a Session
const session = await history.createSession(
'my-crawl-2024', // Unique session ID
'https://example.com', // Base URL
{ version: 2, tags: ['production'] } // Optional metadata
); The CrawlSession interface:
interface CrawlSession {
sessionId: string;
baseUrl: string;
startedAt: number;
lastActivityAt: number;
status: 'running' | 'paused' | 'completed' | 'failed';
urlsVisited: number;
urlsQueued: number;
urlsFailed: number;
metadata?: string; // JSON-serialized metadata
} Retrieving a Session
const session = await history.getSession('my-crawl-2024');
if (session) {
console.log(`Status: ${session.status}`);
console.log(`Visited: ${session.urlsVisited}`);
console.log(`Queued: ${session.urlsQueued}`);
} Updating Session Status
await history.updateSessionStatus('my-crawl-2024', 'paused');
await history.updateSessionStatus('my-crawl-2024', 'completed');
await history.updateSessionStatus('my-crawl-2024', 'failed'); Updating Session Statistics
await history.updateSessionStats('my-crawl-2024', {
urlsVisited: 1500,
urlsQueued: 350,
urlsFailed: 12
}); Finding Resumable Sessions
Get all sessions that can be resumed (status is running or paused):
const resumable = await history.getResumableSessions();
for (const session of resumable) {
console.log(`${session.sessionId}: ${session.status} (${session.urlsVisited} visited, ${session.urlsQueued} queued)`);
} Queue Management
The queue stores pending URLs with priority, HTTP method, headers, body, and metadata.
Adding URLs to the Queue
const added = await history.addToQueue('my-crawl-2024', 'https://example.com/page1', {
method: 'GET', // HTTP method (default: 'GET')
priority: 10, // Higher = processed first (default: 0)
body: null, // Request body (for POST/PUT)
headers: { 'Accept': 'text/html' },
metadata: { depth: 2, source: 'https://example.com' }
});
// Returns false if URL is already queued or already visited The addToQueue() method automatically deduplicates — it checks both the queue and visited tables before adding. Returns true if the URL was added, false if it was already known.
Getting the Next URL
Retrieves the highest-priority URL (FIFO within same priority):
const next = await history.getNextFromQueue('my-crawl-2024');
if (next) {
console.log(next.url); // 'https://example.com/page1'
console.log(next.method); // 'GET'
console.log(next.priority); // 10
console.log(next.addedAt); // Timestamp
} The QueuedUrl interface:
interface QueuedUrl {
url: string;
method: string;
priority: number;
body?: string;
headers?: string; // JSON-serialized
metadata?: string; // JSON-serialized
addedAt: number;
} Removing from Queue
await history.removeFromQueue('my-crawl-2024', 'https://example.com/page1'); Queue Inspection
const size = await history.getQueueSize('my-crawl-2024');
console.log(`${size} URLs remaining in queue`);
const allQueued = await history.getAllQueuedUrls('my-crawl-2024');
for (const item of allQueued) {
console.log(`${item.priority}: ${item.url}`);
} Visited URL Tracking
Marking URLs as Visited
When a URL is processed, mark it as visited. This also removes it from the queue:
await history.markVisited('my-crawl-2024', 'https://example.com/page1', {
status: 200,
finalUrl: 'https://example.com/page1/', // After redirects
contentType: 'text/html',
errorMessage: undefined
}); The VisitedUrl interface:
interface VisitedUrl {
url: string;
status: number;
visitedAt: number;
finalUrl?: string;
contentType?: string;
errorMessage?: string;
} Checking if Visited
const visited = await history.isVisited('my-crawl-2024', 'https://example.com/page1'); Counting and Inspecting
const visitedCount = await history.getVisitedCount('my-crawl-2024');
// Get all URLs that failed (status >= 400 or with error messages)
const failed = await history.getFailedUrls('my-crawl-2024');
for (const url of failed) {
console.log(`${url.url}: ${url.status} ${url.errorMessage}`);
} Session Cleanup
// Clear just the queue
await history.clearQueue('my-crawl-2024');
// Clear just the visited list
await history.clearVisited('my-crawl-2024');
// Delete entire session (queue + visited + session record)
await history.deleteSession('my-crawl-2024'); Closing
await history.close(); Resume Pattern
The typical resume workflow:
import { Crawler, NavigationHistory } from 'rezo/crawler';
const SESSION_ID = 'product-crawl-v3';
const history = await NavigationHistory.create({
storeDir: './crawl-state'
});
// Check for existing session
const existing = await history.getSession(SESSION_ID);
if (existing && (existing.status === 'running' || existing.status === 'paused')) {
console.log(`Resuming session: ${existing.urlsVisited} visited, ${existing.urlsQueued} queued`);
// Get remaining URLs from the queue
const queued = await history.getAllQueuedUrls(SESSION_ID);
await history.updateSessionStatus(SESSION_ID, 'running');
const crawler = new Crawler({
baseUrl: existing.baseUrl,
enableNavigationHistory: true,
sessionId: SESSION_ID,
enableSignalHandlers: true
});
// Re-queue all pending URLs
for (const item of queued) {
await crawler.visit(item.url);
}
await crawler.done();
await history.updateSessionStatus(SESSION_ID, 'completed');
} else {
console.log('Starting fresh crawl');
await history.createSession(SESSION_ID, 'https://example.com');
const crawler = new Crawler({
baseUrl: 'https://example.com',
enableNavigationHistory: true,
sessionId: SESSION_ID,
enableSignalHandlers: true
});
await crawler.visit('https://example.com');
await crawler.done();
await history.updateSessionStatus(SESSION_ID, 'completed');
}
await history.close(); When enableSignalHandlers: true, the crawler automatically saves session state on SIGINT / SIGTERM, sets the session status to paused, and exits cleanly.