___                  ____                      _           
 / _ \  ___  _ __  ___/ ___|_ __ __ ___      _| | ___ _ __ 
| | | |/ _ \| '_ \/ __| |   | '__/ _` \ \ /\ / / |/ _ \ '__|
| |_| | (_) | |_) \__ \ |___| | | (_| |\ V  V /| |  __/ |   
 \___/ \___/| .__/|___/\____|_|  \__,_| \_/\_/ |_|\___|_|   
            |_|                                            

made by lazy_sharaf

Python 3.7+ BeautifulSoup4 Requests Recursive Crawling
← Return to Projects
soraf@kali:~/tools/OopsCrawler$ python3 oopscrawler.py target.com
[+] Initializing OopsCrawler...
[+] Starting recursive crawl on: https://target.com

[~] ⚙️ Crawling in progress... Found 42 internal links.

[❌ BROKEN] https://target.com/legacy/old-doc.pdf
Status: 404 Not Found
Source: https://target.com/resources

[🚫 BLOCKED] https://target.com/admin/login
Status: 403 Forbidden
Source: https://target.com/footer

[✓] Crawl complete! 2 problematic links found.
Combined report saved to: oopscrawler_report.csv

01. The Objective

Scaling web applications inevitably leads to "link rot" — dead ends, broken assets, and blocked paths that frustrate users and hurt SEO. OopsCrawler was born from a need to automate site health checks. Instead of clicking every link manually, OopsCrawler dives deep into a domain, validating every single anchor tag to find those "Oops" moments before a user does.

02. Technical Architecture

OopsCrawler utilizes a recursive depth-first search (DFS) algorithm to navigate internal links. It leverages BeautifulSoup4 for efficient DOM parsing and Requests for high-performance HTTP validation. To enhance UX, it features an animated terminal spinner and handles rate-limiting and whitelisting gracefully to avoid being flagged as a malicious bot.

Core Engine

Recursive DFS crawling with BeautifulSoup4 and Python Requests.

Validation Logic

Smart HTTP status code analysis combined with phrase detection for custom error pages.

Reporting

Aggregated CSV reports detailing the source site, original URL, and failure type.

03. Challenges & Solutions

Handling massive single-page applications (SPAs) and sites with infinite loops was the primary challenge. I implemented a robust "visited" URL set to prevent cyclical crawling and added whitelisting for major domains (like GitHub or LinkedIn) to reduce unnecessary network noise while focusing strictly on the target domain's internal health.

View Repository