Crawler¶
The multi-page BFS crawler discovers and fetches pages across a site, respecting robots.txt and staying within the same domain.
Usage¶
Options¶
| Flag | Default | Description |
|---|---|---|
--url |
Required | Starting URL |
--depth |
2 |
Maximum crawl depth from the start URL |
--max-pages |
50 |
Maximum number of pages to fetch |
--output-dir |
/tmp/crawl |
Directory for saved HTML files |
--respect-robots |
true |
Honour robots.txt disallow rules |
--delay |
1.0 |
Delay in seconds between requests |
--user-agent |
FATAgent/2.0 |
User-Agent string |
How It Works¶
- Seed -- starts from the provided URL
- Parse -- extracts all
<a href="...">links from each page - Filter -- keeps only same-domain links, respects robots.txt
- Queue -- adds new URLs to the BFS queue up to max depth
- Save -- writes each page's HTML to the output directory
- Report -- outputs a crawl manifest JSON listing all fetched URLs
Output¶
The output directory contains:
/tmp/crawl/
manifest.json # List of crawled URLs with status codes
page_001.html # First page HTML
page_002.html # Second page HTML
...
The manifest JSON can be fed directly into bulk_audit.py for batch analysis.
Integration with Auditing¶
Crawl a site and then audit every page:
python scripts/crawl.py --url https://example.com --output-dir /tmp/crawl
for f in /tmp/crawl/page_*.html; do
python scripts/analyse-html.py "$f" | python scripts/calculate-score.py
done
Or use the bulk audit tool for a more structured approach.