Robots.txt

The RobotsTxt class parses robots.txt files and validates URLs against their rules. It supports User-agent, Allow, Disallow, Crawl-delay, and Sitemap directives with wildcard pattern matching and a 24-hour cache.

import { RobotsTxt } from 'rezo/crawler';

Basic Usage

const robots = new RobotsTxt({
  userAgent: 'RezoBot',     // Bot name to match against rules (default: 'RezoBot')
  cacheTTL: 86400000         // Cache TTL in ms (default: 24 hours)
});

// Fetch and parse robots.txt for a domain
await robots.fetch('https://example.com', async (url) => {
  const res = await rezo.get(url);
  return { status: res.status, data: res.data };
});

// Check if a URL is allowed
const allowed = robots.isAllowed('https://example.com/admin/config');
// false (if robots.txt disallows /admin/)

Parsing Rules

parse()

Parses robots.txt content directly and returns a RobotsTxtDirectives object:

const content = `
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /admin/public/
Crawl-delay: 2

User-agent: RezoBot
Disallow: /api/internal/
Allow: /api/public/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-products.xml
`;

const directives = robots.parse(content);

console.log(directives.rules);
// [
//   { path: '/api/internal/', allow: false },
//   { path: '/api/public/', allow: true }
// ]

console.log(directives.crawlDelay);
// undefined (specific RezoBot rules don't have crawl-delay)

console.log(directives.sitemaps);
// ['https://example.com/sitemap.xml', 'https://example.com/sitemap-products.xml']

Rule Resolution

The parser follows standard precedence:

Specific user-agent match — If a User-agent block matches the bot name (case-insensitive partial match), those rules are used
Wildcard fallback — If no specific match exists, User-agent: * rules apply
Path specificity — Longer path patterns take priority over shorter ones
Allow over Disallow — When two rules have the same path length, Allow takes precedence

// Given these rules:
// Disallow: /api/
// Allow: /api/public/

robots.isAllowed('https://example.com/api/users');    // false
robots.isAllowed('https://example.com/api/public/v1'); // true (longer Allow wins)

Wildcard Matching

Robots.txt patterns support * (match anything) and $ (end of URL):

const content = `
User-agent: *
Disallow: /*.pdf$
Disallow: /search?*q=
Allow: /
`;

const directives = robots.parse(content);

robots.isAllowed('https://example.com/docs/report.pdf', directives);
// false (matches /*.pdf$)

robots.isAllowed('https://example.com/docs/report.pdf?v=2', directives);
// true ($ means exact end, and this URL has query params)

robots.isAllowed('https://example.com/search?q=test', directives);
// false (matches /search?*q=)

robots.isAllowed('https://example.com/about', directives);
// true (matches Allow: /)

Fetching and Caching

fetch()

Fetches robots.txt from a domain, parses it, and caches the result:

const directives = await robots.fetch('https://example.com', async (url) => {
  const res = await rezo.get(url);
  return { status: res.status, data: res.data };
});

The fetcher function must return { status: number; data?: string }. If the fetch fails or returns a non-200 status, all URLs are allowed by default.

Caching Behavior

Results are cached per domain origin (e.g., https://example.com)
Default cache TTL is 24 hours (configurable via cacheTTL)
Subsequent calls to fetch() or isAllowed() for the same domain use the cached result
Failed fetches still cache an “allow all” result to avoid re-fetching

// Check cache status
const isCached = robots.isCached('https://example.com');

// Clear cache for one domain
robots.clearCache('https://example.com');

// Clear all cached robots.txt data
robots.clearCache();

Crawl Delay

Retrieve the Crawl-delay value for a domain (in milliseconds):

const delay = robots.getCrawlDelay('https://example.com');
// e.g., 2000 (2 seconds, parsed from "Crawl-delay: 2")

if (delay) {
  await new Promise(resolve => setTimeout(resolve, delay));
}

The Crawl-delay directive value is parsed as seconds and converted to milliseconds.

Sitemap Discovery

Retrieve sitemap URLs declared in robots.txt:

const sitemaps = robots.getSitemaps('https://example.com');
// ['https://example.com/sitemap.xml', 'https://example.com/sitemap-products.xml']

Integration with Crawler

When respectRobotsTxt: true is set in crawler options, the crawler automatically fetches and checks robots.txt before visiting each URL:

import { Crawler } from 'rezo/crawler';

const crawler = new Crawler({
  baseUrl: 'https://example.com',
  respectRobotsTxt: true
});

// The crawler will:
// 1. Fetch https://example.com/robots.txt on first request
// 2. Cache the parsed rules for 24 hours
// 3. Check every URL against the rules before visiting
// 4. Skip disallowed URLs silently
// 5. Respect Crawl-delay if specified

await crawler.visit('https://example.com');
await crawler.done();

RobotsTxtDirectives Interface

interface RobotsTxtDirectives {
  rules: RobotsTxtRule[];     // Sorted by path length (longest first)
  crawlDelay?: number;        // In milliseconds
  sitemaps: string[];         // Sitemap URLs
}

interface RobotsTxtRule {
  path: string;               // URL path pattern (may contain * and $)
  allow: boolean;             // true = Allow, false = Disallow
}