Crawler

Email Discovery

The Scraper class provides automated email extraction from web pages. It parses HTML content for email addresses, follows links to contact-related pages, handles Linktree profile pages, and supports recursive crawling with depth control. Discovered emails are emitted through event handlers with metadata about where they were found.

import { Crawler } from 'rezo/crawler';

Basic Usage

Email discovery is configured through the crawler’s event handlers:

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  concurrency: 20,
  scraperConcurrency: 10
});

// Called for each email address found
crawler.onEmailDiscovered(async function (event) {
  console.log(`Found: ${event.email} at ${event.discoveredAt}`);
});

// Called with all emails found on a single page (batched)
crawler.onEmails(async function (leads) {
  console.log(`Found ${leads.length} emails on this page`);
  for (const lead of leads) {
    await saveToDatabase(lead);
  }
});

await crawler.visit('https://example.com');
await crawler.done();

EmailDiscoveryEvent

Each discovered email is wrapped in an event object:

interface EmailDiscoveryEvent<T = Record<string, any>> {
  /** The email address found */
  email: string;

  /** URL where the email was discovered */
  discoveredAt: string;

  /** Timestamp of discovery */
  timestamp: Date;

  /** Custom metadata (passed via crawler configuration) */
  metadata: T;
}

Email Extraction

How It Works

The scraper automatically extracts and validates email addresses from page content:

  1. Strips HTML tags and non-text content
  2. Detects and resolves mailto: links
  3. Extracts email candidates from the cleaned text
  4. Validates each candidate against standard email format rules
  5. Deduplicates results to avoid reporting the same address twice

parseExternalWebsite()

The core method for recursive email discovery from external sites. It follows links, parses pages, and extracts emails with intelligent depth control:

const scraper = new Scraper(http, crawlerOptions, onEmailLeads, onEmailDiscovered);

const emails = await scraper.parseExternalWebsite(
  'https://example.com',  // Target URL
  'GET',                  // HTTP method
  null,                   // Request body
  {
    getCache: async (key) => cache.get(key),
    saveCache: async (key, value) => cache.set(key, value),
    hasUrlInCache: async (url) => store.has(url),
    saveUrl: async (url) => store.set(url),
    onEmailDiscovered: [handler1, handler2],
    onEmails: [batchHandler],
    queue: scraperQueue,
    depth: 2,                      // Max link-following depth
    allowCrossDomainTravel: false   // Stay on the same domain
  }
);

Depth Control

The scraper limits how many link hops it follows from the starting page:

  • Depth 0 — Only extract emails from the starting URL
  • Depth 1 — Follow links on the starting page, extract from those pages
  • Depth 2 (typical) — Follow links two levels deep

Each level processes links found on the current page and recursively visits them at depth - 1.

Linktree Profile Parsing

The scraper includes specialized handling for Linktree pages. When it encounters a Linktree URL, it parses the profile structure to extract linked pages and recursively discovers emails from each linked site:

// The scraper detects Linktree URLs automatically
// https://linktr.ee/username -> parses all links on the profile
// Each linked page is crawled for email addresses

Restricted Domains

The scraper maintains a list of domains that should not be crawled for emails. These include social media platforms, search engines, and other non-relevant sites:

// Built-in restricted domains include:
// facebook.com, twitter.com, instagram.com, linkedin.com,
// google.com, youtube.com, github.com, wikipedia.org,
// apple.com, microsoft.com, amazon.com, etc.

Links to restricted domains are skipped during recursive crawling to avoid wasting resources on sites unlikely to contain contact emails.

Forbidden Protocols

The scraper skips URLs with non-HTTP protocols:

// Skipped protocols:
// mailto:, tel:, javascript:, data:, sms:, ftp:,
// file:, irc:, blob:, chrome:, about:, intent:

When following links for email discovery, the scraper prioritizes pages likely to contain contact information. Links with contact-related keywords in their URL or anchor text are processed first:

  • /contact
  • /about
  • /team
  • /staff
  • /people
  • /support
  • /help

Integration with Crawler

The scraper runs on a separate queue from the main crawler to prevent email extraction from blocking page crawling:

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  concurrency: 30,           // Main crawler concurrency
  scraperConcurrency: 10     // Separate scraper queue concurrency
});

// Email handlers are invoked from the scraper queue
crawler.onEmailDiscovered(async function (event) {
  // This runs on the scraper queue, not the main queue
  await db.insert('emails', {
    email: event.email,
    source: event.discoveredAt,
    foundAt: event.timestamp
  });
});

crawler.onEmails(async function (leads) {
  // Batch handler - receives all emails from a single page
  if (leads.length > 0) {
    await db.insertMany('emails', leads.map(l => ({
      email: l.email,
      source: l.discoveredAt
    })));
  }
});

await crawler.visit('https://example.com');
await crawler.done();

Deduplication

The scraper uses a CappedSet with a capacity of 10,000 to track discovered emails. Each email is checked against this set before being emitted, ensuring no duplicate events:

// Internal deduplication:
// 1. Email "user@example.com" found on page A -> emitted
// 2. Same email found on page B -> skipped (already in CappedSet)
// 3. After 10,000 unique emails, oldest entries are evicted
//    (LRU eviction at 10% batches)

Complete Example

import { Crawler } from 'rezo/crawler';
import { RezoStealth } from 'rezo/stealth';

const crawler = new Crawler({
  baseUrl: 'https://directory.example.com',
  concurrency: 20,
  scraperConcurrency: 5,
  maxDepth: 3,
  timeout: 15000,
  stealth: RezoStealth.chrome(),
  enableCache: true,
  respectRobotsTxt: true
});

const allEmails = new Map<string, { email: string; source: string; foundAt: Date }>();

crawler.onEmailDiscovered(async function (event) {
  if (!allEmails.has(event.email)) {
    allEmails.set(event.email, {
      email: event.email,
      source: event.discoveredAt,
      foundAt: event.timestamp
    });
  }
});

crawler.onDocument(async function (document, response) {
  // The crawler automatically extracts emails from document content
  console.log(`Processed: ${response.url}`);
});

await crawler.visit('https://directory.example.com');
await crawler.done();

console.log(`Discovered ${allEmails.size} unique email addresses`);