Email Discovery
The Scraper class provides automated email extraction from web pages. It parses HTML content for email addresses, follows links to contact-related pages, handles Linktree profile pages, and supports recursive crawling with depth control. Discovered emails are emitted through event handlers with metadata about where they were found.
import { Crawler } from 'rezo/crawler'; Basic Usage
Email discovery is configured through the crawler’s event handlers:
const crawler = new Crawler({
baseUrl: 'https://example.com',
concurrency: 20,
scraperConcurrency: 10
});
// Called for each email address found
crawler.onEmailDiscovered(async function (event) {
console.log(`Found: ${event.email} at ${event.discoveredAt}`);
});
// Called with all emails found on a single page (batched)
crawler.onEmailLeads(async function (leads) {
console.log(`Found ${leads.length} emails on this page`);
for (const lead of leads) {
await saveToDatabase(lead);
}
});
await crawler.visit('https://example.com');
await crawler.done(); EmailDiscoveryEvent
Each discovered email is wrapped in an event object:
interface EmailDiscoveryEvent<T = Record<string, any>> {
/** The email address found */
email: string;
/** URL where the email was discovered */
discoveredAt: string;
/** Timestamp of discovery */
timestamp: Date;
/** Custom metadata (passed via crawler configuration) */
metadata: T;
} Email Extraction
How It Works
The scraper automatically extracts and validates email addresses from page content:
- Strips HTML tags and non-text content
- Detects and resolves
mailto:links - Extracts email candidates from the cleaned text
- Validates each candidate against standard email format rules
- Deduplicates results to avoid reporting the same address twice
Recursive Discovery
When the crawler visits a page, the email-discovery layer also follows links a few hops deep so it can find emails on contact pages and related profiles. Depth is bounded by the crawler’s maxDepth and processes links iteratively:
- Depth 0 — Only extract emails from the starting URL
- Depth 1 — Follow links on the starting page, extract from those pages
- Depth 2 (typical) — Follow links two levels deep
Each level processes links found on the current page and recursively visits them at depth - 1.
Linktree Profile Parsing
The scraper includes specialized handling for Linktree pages. When it encounters a Linktree URL, it parses the profile structure to extract linked pages and recursively discovers emails from each linked site:
// The scraper detects Linktree URLs automatically
// https://linktr.ee/username -> parses all links on the profile
// Each linked page is crawled for email addresses Restricted Domains
The scraper maintains a list of domains that should not be crawled for emails. These include social media platforms, search engines, and other non-relevant sites:
// Built-in restricted domains include:
// facebook.com, twitter.com, instagram.com, linkedin.com,
// google.com, youtube.com, github.com, wikipedia.org,
// apple.com, microsoft.com, amazon.com, etc. Links to restricted domains are skipped during recursive crawling to avoid wasting resources on sites unlikely to contain contact emails.
Forbidden Protocols
The scraper skips URLs with non-HTTP protocols:
// Skipped protocols:
// mailto:, tel:, javascript:, data:, sms:, ftp:,
// file:, irc:, blob:, chrome:, about:, intent: Contact-Related Keyword Prioritization
When following links for email discovery, the scraper prioritizes pages likely to contain contact information. Links with contact-related keywords in their URL or anchor text are processed first:
/contact/about/team/staff/people/support/help
Integration with Crawler
The scraper runs on a separate queue from the main crawler to prevent email extraction from blocking page crawling:
const crawler = new Crawler({
baseUrl: 'https://example.com',
concurrency: 30, // Main crawler concurrency
scraperConcurrency: 10 // Separate scraper queue concurrency
});
// Email handlers are invoked from the scraper queue
crawler.onEmailDiscovered(async function (event) {
// This runs on the scraper queue, not the main queue
await db.insert('emails', {
email: event.email,
source: event.discoveredAt,
foundAt: event.timestamp
});
});
crawler.onEmailLeads(async function (leads) {
// Batch handler - receives all emails from a single page
if (leads.length > 0) {
await db.insertMany('emails', leads.map(l => ({
email: l.email,
source: l.discoveredAt
})));
}
});
await crawler.visit('https://example.com');
await crawler.done(); Deduplication
The scraper uses a CappedSet with a capacity of 10,000 to track discovered emails. Each email is checked against this set before being emitted, ensuring no duplicate events:
// Internal deduplication:
// 1. Email "user@example.com" found on page A -> emitted
// 2. Same email found on page B -> skipped (already in CappedSet)
// 3. After 10,000 unique emails, oldest entries are evicted
// (LRU eviction at 10% batches) Complete Example
import { Crawler } from 'rezo/crawler';
import { RezoStealth } from 'rezo';
const crawler = new Crawler({
baseUrl: 'https://directory.example.com',
concurrency: 20,
scraperConcurrency: 5,
maxDepth: 3,
timeout: 15000,
stealth: RezoStealth.chrome(),
enableCache: true,
respectRobotsTxt: true
});
const allEmails = new Map<string, { email: string; source: string; foundAt: Date }>();
crawler.onEmailDiscovered(async function (event) {
if (!allEmails.has(event.email)) {
allEmails.set(event.email, {
email: event.email,
source: event.discoveredAt,
foundAt: event.timestamp
});
}
});
crawler.onDocument(async function (document, response) {
// The crawler automatically extracts emails from document content
console.log(`Processed: ${response.url}`);
});
await crawler.visit('https://directory.example.com');
await crawler.done();
console.log(`Discovered ${allEmails.size} unique email addresses`);