Email Discovery
The Scraper class provides automated email extraction from web pages. It parses HTML content for email addresses, follows links to contact-related pages, handles Linktree profile pages, and supports recursive crawling with depth control. Discovered emails are emitted through event handlers with metadata about where they were found.
import { Crawler } from 'rezo/crawler'; Basic Usage
Email discovery is configured through the crawler’s event handlers:
const crawler = new Crawler({
baseUrl: 'https://example.com',
concurrency: 20,
scraperConcurrency: 10
});
// Called for each email address found
crawler.onEmailDiscovered(async function (event) {
console.log(`Found: ${event.email} at ${event.discoveredAt}`);
});
// Called with all emails found on a single page (batched)
crawler.onEmails(async function (leads) {
console.log(`Found ${leads.length} emails on this page`);
for (const lead of leads) {
await saveToDatabase(lead);
}
});
await crawler.visit('https://example.com');
await crawler.done(); EmailDiscoveryEvent
Each discovered email is wrapped in an event object:
interface EmailDiscoveryEvent<T = Record<string, any>> {
/** The email address found */
email: string;
/** URL where the email was discovered */
discoveredAt: string;
/** Timestamp of discovery */
timestamp: Date;
/** Custom metadata (passed via crawler configuration) */
metadata: T;
} Email Extraction
How It Works
The scraper automatically extracts and validates email addresses from page content:
- Strips HTML tags and non-text content
- Detects and resolves
mailto:links - Extracts email candidates from the cleaned text
- Validates each candidate against standard email format rules
- Deduplicates results to avoid reporting the same address twice
parseExternalWebsite()
The core method for recursive email discovery from external sites. It follows links, parses pages, and extracts emails with intelligent depth control:
const scraper = new Scraper(http, crawlerOptions, onEmailLeads, onEmailDiscovered);
const emails = await scraper.parseExternalWebsite(
'https://example.com', // Target URL
'GET', // HTTP method
null, // Request body
{
getCache: async (key) => cache.get(key),
saveCache: async (key, value) => cache.set(key, value),
hasUrlInCache: async (url) => store.has(url),
saveUrl: async (url) => store.set(url),
onEmailDiscovered: [handler1, handler2],
onEmails: [batchHandler],
queue: scraperQueue,
depth: 2, // Max link-following depth
allowCrossDomainTravel: false // Stay on the same domain
}
); Depth Control
The scraper limits how many link hops it follows from the starting page:
- Depth 0 — Only extract emails from the starting URL
- Depth 1 — Follow links on the starting page, extract from those pages
- Depth 2 (typical) — Follow links two levels deep
Each level processes links found on the current page and recursively visits them at depth - 1.
Linktree Profile Parsing
The scraper includes specialized handling for Linktree pages. When it encounters a Linktree URL, it parses the profile structure to extract linked pages and recursively discovers emails from each linked site:
// The scraper detects Linktree URLs automatically
// https://linktr.ee/username -> parses all links on the profile
// Each linked page is crawled for email addresses Restricted Domains
The scraper maintains a list of domains that should not be crawled for emails. These include social media platforms, search engines, and other non-relevant sites:
// Built-in restricted domains include:
// facebook.com, twitter.com, instagram.com, linkedin.com,
// google.com, youtube.com, github.com, wikipedia.org,
// apple.com, microsoft.com, amazon.com, etc. Links to restricted domains are skipped during recursive crawling to avoid wasting resources on sites unlikely to contain contact emails.
Forbidden Protocols
The scraper skips URLs with non-HTTP protocols:
// Skipped protocols:
// mailto:, tel:, javascript:, data:, sms:, ftp:,
// file:, irc:, blob:, chrome:, about:, intent: Contact-Related Keyword Prioritization
When following links for email discovery, the scraper prioritizes pages likely to contain contact information. Links with contact-related keywords in their URL or anchor text are processed first:
/contact/about/team/staff/people/support/help
Integration with Crawler
The scraper runs on a separate queue from the main crawler to prevent email extraction from blocking page crawling:
const crawler = new Crawler({
baseUrl: 'https://example.com',
concurrency: 30, // Main crawler concurrency
scraperConcurrency: 10 // Separate scraper queue concurrency
});
// Email handlers are invoked from the scraper queue
crawler.onEmailDiscovered(async function (event) {
// This runs on the scraper queue, not the main queue
await db.insert('emails', {
email: event.email,
source: event.discoveredAt,
foundAt: event.timestamp
});
});
crawler.onEmails(async function (leads) {
// Batch handler - receives all emails from a single page
if (leads.length > 0) {
await db.insertMany('emails', leads.map(l => ({
email: l.email,
source: l.discoveredAt
})));
}
});
await crawler.visit('https://example.com');
await crawler.done(); Deduplication
The scraper uses a CappedSet with a capacity of 10,000 to track discovered emails. Each email is checked against this set before being emitted, ensuring no duplicate events:
// Internal deduplication:
// 1. Email "user@example.com" found on page A -> emitted
// 2. Same email found on page B -> skipped (already in CappedSet)
// 3. After 10,000 unique emails, oldest entries are evicted
// (LRU eviction at 10% batches) Complete Example
import { Crawler } from 'rezo/crawler';
import { RezoStealth } from 'rezo/stealth';
const crawler = new Crawler({
baseUrl: 'https://directory.example.com',
concurrency: 20,
scraperConcurrency: 5,
maxDepth: 3,
timeout: 15000,
stealth: RezoStealth.chrome(),
enableCache: true,
respectRobotsTxt: true
});
const allEmails = new Map<string, { email: string; source: string; foundAt: Date }>();
crawler.onEmailDiscovered(async function (event) {
if (!allEmails.has(event.email)) {
allEmails.set(event.email, {
email: event.email,
source: event.discoveredAt,
foundAt: event.timestamp
});
}
});
crawler.onDocument(async function (document, response) {
// The crawler automatically extracts emails from document content
console.log(`Processed: ${response.url}`);
});
await crawler.visit('https://directory.example.com');
await crawler.done();
console.log(`Discovered ${allEmails.size} unique email addresses`);