Skip to content

Crawler

The multi-page BFS crawler discovers and fetches pages across a site, respecting robots.txt and staying within the same domain.


Usage

python scripts/crawl.py --url https://example.com --depth 2 --max-pages 10 --output-dir /tmp/crawl

Options

Flag Default Description
--url Required Starting URL
--depth 2 Maximum crawl depth from the start URL
--max-pages 50 Maximum number of pages to fetch
--output-dir /tmp/crawl Directory for saved HTML files
--respect-robots true Honour robots.txt disallow rules
--delay 1.0 Delay in seconds between requests
--user-agent FATAgent/2.0 User-Agent string

How It Works

  1. Seed -- starts from the provided URL
  2. Parse -- extracts all <a href="..."> links from each page
  3. Filter -- keeps only same-domain links, respects robots.txt
  4. Queue -- adds new URLs to the BFS queue up to max depth
  5. Save -- writes each page's HTML to the output directory
  6. Report -- outputs a crawl manifest JSON listing all fetched URLs

Output

The output directory contains:

/tmp/crawl/
  manifest.json          # List of crawled URLs with status codes
  page_001.html          # First page HTML
  page_002.html          # Second page HTML
  ...

The manifest JSON can be fed directly into bulk_audit.py for batch analysis.

Integration with Auditing

Crawl a site and then audit every page:

python scripts/crawl.py --url https://example.com --output-dir /tmp/crawl
for f in /tmp/crawl/page_*.html; do
  python scripts/analyse-html.py "$f" | python scripts/calculate-score.py
done

Or use the bulk audit tool for a more structured approach.