Crawler

Resumable Crawling

The NavigationHistory class persists crawl state to SQLite, enabling sessions to be paused, interrupted, and resumed without losing progress. It tracks three things: sessions (metadata), the URL queue (pending work), and visited URLs (completed work).

import { NavigationHistory } from 'rezo/crawler';

Architecture

NavigationHistory uses a single SQLite database with three tables:

navigation.db
  ├── sessions   -- Session metadata (ID, status, counters)
  ├── queue      -- Pending URLs with priority and metadata
  └── visited    -- Completed URLs with status codes and content types

All tables use WAL journal mode and are indexed for efficient lookups by session ID.

Creating a NavigationHistory

const history = await NavigationHistory.create({
  storeDir: '/tmp/my-crawler/navigation',  // Database directory
  dbFileName: 'navigation.db',             // Database filename
  hashUrls: false                          // Hash URLs with SHA-256 (default: false)
});

When hashUrls: true, all URLs are hashed before storage for privacy and fixed-size keys.

Session Management

Creating a Session

const session = await history.createSession(
  'my-crawl-2024',           // Unique session ID
  'https://example.com',     // Base URL
  { version: 2, tags: ['production'] }  // Optional metadata
);

The CrawlSession interface:

interface CrawlSession {
  sessionId: string;
  baseUrl: string;
  startedAt: number;
  lastActivityAt: number;
  status: 'running' | 'paused' | 'completed' | 'failed';
  urlsVisited: number;
  urlsQueued: number;
  urlsFailed: number;
  metadata?: string;   // JSON-serialized metadata
}

Retrieving a Session

const session = await history.getSession('my-crawl-2024');
if (session) {
  console.log(`Status: ${session.status}`);
  console.log(`Visited: ${session.urlsVisited}`);
  console.log(`Queued: ${session.urlsQueued}`);
}

Updating Session Status

await history.updateSessionStatus('my-crawl-2024', 'paused');
await history.updateSessionStatus('my-crawl-2024', 'completed');
await history.updateSessionStatus('my-crawl-2024', 'failed');

Updating Session Statistics

await history.updateSessionStats('my-crawl-2024', {
  urlsVisited: 1500,
  urlsQueued: 350,
  urlsFailed: 12
});

Finding Resumable Sessions

Get all sessions that can be resumed (status is running or paused):

const resumable = await history.getResumableSessions();
for (const session of resumable) {
  console.log(`${session.sessionId}: ${session.status} (${session.urlsVisited} visited, ${session.urlsQueued} queued)`);
}

Queue Management

The queue stores pending URLs with priority, HTTP method, headers, body, and metadata.

Adding URLs to the Queue

const added = await history.addToQueue('my-crawl-2024', 'https://example.com/page1', {
  method: 'GET',            // HTTP method (default: 'GET')
  priority: 10,             // Higher = processed first (default: 0)
  body: null,               // Request body (for POST/PUT)
  headers: { 'Accept': 'text/html' },
  metadata: { depth: 2, source: 'https://example.com' }
});
// Returns false if URL is already queued or already visited

The addToQueue() method automatically deduplicates — it checks both the queue and visited tables before adding. Returns true if the URL was added, false if it was already known.

Getting the Next URL

Retrieves the highest-priority URL (FIFO within same priority):

const next = await history.getNextFromQueue('my-crawl-2024');
if (next) {
  console.log(next.url);       // 'https://example.com/page1'
  console.log(next.method);    // 'GET'
  console.log(next.priority);  // 10
  console.log(next.addedAt);   // Timestamp
}

The QueuedUrl interface:

interface QueuedUrl {
  url: string;
  method: string;
  priority: number;
  body?: string;
  headers?: string;    // JSON-serialized
  metadata?: string;   // JSON-serialized
  addedAt: number;
}

Removing from Queue

await history.removeFromQueue('my-crawl-2024', 'https://example.com/page1');

Queue Inspection

const size = await history.getQueueSize('my-crawl-2024');
console.log(`${size} URLs remaining in queue`);

const allQueued = await history.getAllQueuedUrls('my-crawl-2024');
for (const item of allQueued) {
  console.log(`${item.priority}: ${item.url}`);
}

Visited URL Tracking

Marking URLs as Visited

When a URL is processed, mark it as visited. This also removes it from the queue:

await history.markVisited('my-crawl-2024', 'https://example.com/page1', {
  status: 200,
  finalUrl: 'https://example.com/page1/',  // After redirects
  contentType: 'text/html',
  errorMessage: undefined
});

The VisitedUrl interface:

interface VisitedUrl {
  url: string;
  status: number;
  visitedAt: number;
  finalUrl?: string;
  contentType?: string;
  errorMessage?: string;
}

Checking if Visited

const visited = await history.isVisited('my-crawl-2024', 'https://example.com/page1');

Counting and Inspecting

const visitedCount = await history.getVisitedCount('my-crawl-2024');

// Get all URLs that failed (status >= 400 or with error messages)
const failed = await history.getFailedUrls('my-crawl-2024');
for (const url of failed) {
  console.log(`${url.url}: ${url.status} ${url.errorMessage}`);
}

Session Cleanup

// Clear just the queue
await history.clearQueue('my-crawl-2024');

// Clear just the visited list
await history.clearVisited('my-crawl-2024');

// Delete entire session (queue + visited + session record)
await history.deleteSession('my-crawl-2024');

Closing

await history.close();

Resume Pattern

The typical resume workflow:

import { Crawler, NavigationHistory } from 'rezo/crawler';

const SESSION_ID = 'product-crawl-v3';

const history = await NavigationHistory.create({
  storeDir: './crawl-state'
});

// Check for existing session
const existing = await history.getSession(SESSION_ID);

if (existing && (existing.status === 'running' || existing.status === 'paused')) {
  console.log(`Resuming session: ${existing.urlsVisited} visited, ${existing.urlsQueued} queued`);

  // Get remaining URLs from the queue
  const queued = await history.getAllQueuedUrls(SESSION_ID);

  await history.updateSessionStatus(SESSION_ID, 'running');

  const crawler = new Crawler({
    baseUrl: existing.baseUrl,
    enableNavigationHistory: true,
    sessionId: SESSION_ID,
    enableSignalHandlers: true
  });

  // Re-queue all pending URLs
  for (const item of queued) {
    await crawler.visit(item.url);
  }

  await crawler.done();
  await history.updateSessionStatus(SESSION_ID, 'completed');

} else {
  console.log('Starting fresh crawl');

  await history.createSession(SESSION_ID, 'https://example.com');

  const crawler = new Crawler({
    baseUrl: 'https://example.com',
    enableNavigationHistory: true,
    sessionId: SESSION_ID,
    enableSignalHandlers: true
  });

  await crawler.visit('https://example.com');
  await crawler.done();
  await history.updateSessionStatus(SESSION_ID, 'completed');
}

await history.close();

When enableSignalHandlers: true, the crawler automatically saves session state on SIGINT / SIGTERM, sets the session status to paused, and exits cleanly.