Crawler

Robots.txt

The RobotsTxt class parses robots.txt files and validates URLs against their rules. It supports User-agent, Allow, Disallow, Crawl-delay, and Sitemap directives with wildcard pattern matching and a 24-hour cache.

import { RobotsTxt } from 'rezo/crawler';

Basic Usage

const robots = new RobotsTxt({
  userAgent: 'RezoBot',     // Bot name to match against rules (default: 'RezoBot')
  cacheTTL: 86400000         // Cache TTL in ms (default: 24 hours)
});

// Fetch and parse robots.txt for a domain
await robots.fetch('https://example.com', async (url) => {
  const res = await rezo.get(url);
  return { status: res.status, data: res.data };
});

// Check if a URL is allowed
const allowed = robots.isAllowed('https://example.com/admin/config');
// false (if robots.txt disallows /admin/)

Parsing Rules

parse()

Parses robots.txt content directly and returns a RobotsTxtDirectives object:

const content = `
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /admin/public/
Crawl-delay: 2

User-agent: RezoBot
Disallow: /api/internal/
Allow: /api/public/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-products.xml
`;

const directives = robots.parse(content);

console.log(directives.rules);
// [
//   { path: '/api/internal/', allow: false },
//   { path: '/api/public/', allow: true }
// ]

console.log(directives.crawlDelay);
// undefined (specific RezoBot rules don't have crawl-delay)

console.log(directives.sitemaps);
// ['https://example.com/sitemap.xml', 'https://example.com/sitemap-products.xml']

Rule Resolution

The parser follows standard precedence:

  1. Specific user-agent match — If a User-agent block matches the bot name (case-insensitive partial match), those rules are used
  2. Wildcard fallback — If no specific match exists, User-agent: * rules apply
  3. Path specificity — Longer path patterns take priority over shorter ones
  4. Allow over Disallow — When two rules have the same path length, Allow takes precedence
// Given these rules:
// Disallow: /api/
// Allow: /api/public/

robots.isAllowed('https://example.com/api/users');    // false
robots.isAllowed('https://example.com/api/public/v1'); // true (longer Allow wins)

Wildcard Matching

Robots.txt patterns support * (match anything) and $ (end of URL):

const content = `
User-agent: *
Disallow: /*.pdf$
Disallow: /search?*q=
Allow: /
`;

const directives = robots.parse(content);

robots.isAllowed('https://example.com/docs/report.pdf', directives);
// false (matches /*.pdf$)

robots.isAllowed('https://example.com/docs/report.pdf?v=2', directives);
// true ($ means exact end, and this URL has query params)

robots.isAllowed('https://example.com/search?q=test', directives);
// false (matches /search?*q=)

robots.isAllowed('https://example.com/about', directives);
// true (matches Allow: /)

Fetching and Caching

fetch()

Fetches robots.txt from a domain, parses it, and caches the result:

const directives = await robots.fetch('https://example.com', async (url) => {
  const res = await rezo.get(url);
  return { status: res.status, data: res.data };
});

The fetcher function must return { status: number; data?: string }. If the fetch fails or returns a non-200 status, all URLs are allowed by default.

Caching Behavior

  • Results are cached per domain origin (e.g., https://example.com)
  • Default cache TTL is 24 hours (configurable via cacheTTL)
  • Subsequent calls to fetch() or isAllowed() for the same domain use the cached result
  • Failed fetches still cache an “allow all” result to avoid re-fetching
// Check cache status
const isCached = robots.isCached('https://example.com');

// Clear cache for one domain
robots.clearCache('https://example.com');

// Clear all cached robots.txt data
robots.clearCache();

Crawl Delay

Retrieve the Crawl-delay value for a domain (in milliseconds):

const delay = robots.getCrawlDelay('https://example.com');
// e.g., 2000 (2 seconds, parsed from "Crawl-delay: 2")

if (delay) {
  await new Promise(resolve => setTimeout(resolve, delay));
}

The Crawl-delay directive value is parsed as seconds and converted to milliseconds.

Sitemap Discovery

Retrieve sitemap URLs declared in robots.txt:

const sitemaps = robots.getSitemaps('https://example.com');
// ['https://example.com/sitemap.xml', 'https://example.com/sitemap-products.xml']

Integration with Crawler

When respectRobotsTxt: true is set in crawler options, the crawler automatically fetches and checks robots.txt before visiting each URL:

import { Crawler } from 'rezo/crawler';

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  respectRobotsTxt: true
});

// The crawler will:
// 1. Fetch https://example.com/robots.txt on first request
// 2. Cache the parsed rules for 24 hours
// 3. Check every URL against the rules before visiting
// 4. Skip disallowed URLs silently
// 5. Respect Crawl-delay if specified

await crawler.visit('https://example.com');
await crawler.done();

RobotsTxtDirectives Interface

interface RobotsTxtDirectives {
  rules: RobotsTxtRule[];     // Sorted by path length (longest first)
  crawlDelay?: number;        // In milliseconds
  sitemaps: string[];         // Sitemap URLs
}

interface RobotsTxtRule {
  path: string;               // URL path pattern (may contain * and $)
  allow: boolean;             // true = Allow, false = Disallow
}