Crawler¶

The multi-page BFS crawler discovers and fetches pages across a site, respecting robots.txt and staying within the same domain.

Usage¶

python scripts/crawl.py --url https://example.com --depth 2 --max-pages 10 --output-dir /tmp/crawl

Options¶

Flag	Default	Description
`--url`	Required	Starting URL
`--depth`	`2`	Maximum crawl depth from the start URL
`--max-pages`	`50`	Maximum number of pages to fetch
`--output-dir`	`/tmp/crawl`	Directory for saved HTML files
`--respect-robots`	`true`	Honour robots.txt disallow rules
`--delay`	`1.0`	Delay in seconds between requests
`--user-agent`	`FATAgent/2.0`	User-Agent string

How It Works¶

Seed -- starts from the provided URL
Parse -- extracts all <a href="..."> links from each page
Filter -- keeps only same-domain links, respects robots.txt
Queue -- adds new URLs to the BFS queue up to max depth
Save -- writes each page's HTML to the output directory
Report -- outputs a crawl manifest JSON listing all fetched URLs

Output¶

The output directory contains:

/tmp/crawl/
  manifest.json          # List of crawled URLs with status codes
  page_001.html          # First page HTML
  page_002.html          # Second page HTML
  ...

The manifest JSON can be fed directly into bulk_audit.py for batch analysis.

Integration with Auditing¶

Crawl a site and then audit every page:

python scripts/crawl.py --url https://example.com --output-dir /tmp/crawl
for f in /tmp/crawl/page_*.html; do
  python scripts/analyse-html.py "$f" | python scripts/calculate-score.py
done

Or use the bulk audit tool for a more structured approach.