Robots.txt
The RobotsTxt class parses robots.txt files and validates URLs against their rules. It supports User-agent, Allow, Disallow, Crawl-delay, and Sitemap directives with wildcard pattern matching and a 24-hour cache.
import { RobotsTxt } from 'rezo/crawler'; Basic Usage
const robots = new RobotsTxt({
userAgent: 'RezoBot', // Bot name to match against rules (default: 'RezoBot')
cacheTTL: 86400000 // Cache TTL in ms (default: 24 hours)
});
// Fetch and parse robots.txt for a domain
await robots.fetch('https://example.com', async (url) => {
const res = await rezo.get(url);
return { status: res.status, data: res.data };
});
// Check if a URL is allowed
const allowed = robots.isAllowed('https://example.com/admin/config');
// false (if robots.txt disallows /admin/) Parsing Rules
parse()
Parses robots.txt content directly and returns a RobotsTxtDirectives object:
const content = `
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /admin/public/
Crawl-delay: 2
User-agent: RezoBot
Disallow: /api/internal/
Allow: /api/public/
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-products.xml
`;
const directives = robots.parse(content);
console.log(directives.rules);
// [
// { path: '/api/internal/', allow: false },
// { path: '/api/public/', allow: true }
// ]
console.log(directives.crawlDelay);
// undefined (specific RezoBot rules don't have crawl-delay)
console.log(directives.sitemaps);
// ['https://example.com/sitemap.xml', 'https://example.com/sitemap-products.xml'] Rule Resolution
The parser follows standard precedence:
- Specific user-agent match — If a
User-agentblock matches the bot name (case-insensitive partial match), those rules are used - Wildcard fallback — If no specific match exists,
User-agent: *rules apply - Path specificity — Longer path patterns take priority over shorter ones
- Allow over Disallow — When two rules have the same path length,
Allowtakes precedence
// Given these rules:
// Disallow: /api/
// Allow: /api/public/
robots.isAllowed('https://example.com/api/users'); // false
robots.isAllowed('https://example.com/api/public/v1'); // true (longer Allow wins) Wildcard Matching
Robots.txt patterns support * (match anything) and $ (end of URL):
const content = `
User-agent: *
Disallow: /*.pdf$
Disallow: /search?*q=
Allow: /
`;
const directives = robots.parse(content);
robots.isAllowed('https://example.com/docs/report.pdf', directives);
// false (matches /*.pdf$)
robots.isAllowed('https://example.com/docs/report.pdf?v=2', directives);
// true ($ means exact end, and this URL has query params)
robots.isAllowed('https://example.com/search?q=test', directives);
// false (matches /search?*q=)
robots.isAllowed('https://example.com/about', directives);
// true (matches Allow: /) Fetching and Caching
fetch()
Fetches robots.txt from a domain, parses it, and caches the result:
const directives = await robots.fetch('https://example.com', async (url) => {
const res = await rezo.get(url);
return { status: res.status, data: res.data };
}); The fetcher function must return { status: number; data?: string }. If the fetch fails or returns a non-200 status, all URLs are allowed by default.
Caching Behavior
- Results are cached per domain origin (e.g.,
https://example.com) - Default cache TTL is 24 hours (configurable via
cacheTTL) - Subsequent calls to
fetch()orisAllowed()for the same domain use the cached result - Failed fetches still cache an “allow all” result to avoid re-fetching
// Check cache status
const isCached = robots.isCached('https://example.com');
// Clear cache for one domain
robots.clearCache('https://example.com');
// Clear all cached robots.txt data
robots.clearCache(); Crawl Delay
Retrieve the Crawl-delay value for a domain (in milliseconds):
const delay = robots.getCrawlDelay('https://example.com');
// e.g., 2000 (2 seconds, parsed from "Crawl-delay: 2")
if (delay) {
await new Promise(resolve => setTimeout(resolve, delay));
} The Crawl-delay directive value is parsed as seconds and converted to milliseconds.
Sitemap Discovery
Retrieve sitemap URLs declared in robots.txt:
const sitemaps = robots.getSitemaps('https://example.com');
// ['https://example.com/sitemap.xml', 'https://example.com/sitemap-products.xml'] Integration with Crawler
When respectRobotsTxt: true is set in crawler options, the crawler automatically fetches and checks robots.txt before visiting each URL:
import { Crawler } from 'rezo/crawler';
const crawler = new Crawler({
baseUrl: 'https://example.com',
respectRobotsTxt: true
});
// The crawler will:
// 1. Fetch https://example.com/robots.txt on first request
// 2. Cache the parsed rules for 24 hours
// 3. Check every URL against the rules before visiting
// 4. Skip disallowed URLs silently
// 5. Respect Crawl-delay if specified
await crawler.visit('https://example.com');
await crawler.done(); RobotsTxtDirectives Interface
interface RobotsTxtDirectives {
rules: RobotsTxtRule[]; // Sorted by path length (longest first)
crawlDelay?: number; // In milliseconds
sitemaps: string[]; // Sitemap URLs
}
interface RobotsTxtRule {
path: string; // URL path pattern (may contain * and $)
allow: boolean; // true = Allow, false = Disallow
}