About the FirstPass Crawler (FPC)
This is a simple Website crawler that I vibe-coded (mostly with ChatGpt, and 2 more AIs) , mainly for starting SEO analysis of non-wordpress websites.
Why? Because the first thing that you need to check out a website, is to have a fast overview of it’s contents. Usually I fire up 2,3 website crawlers and compile the results. But then I also have to go through the long url lists, and sieve and sort and discard and filter out my first list of clean urls that I really need. Much of this is an overkill with excessive details and wastes time and effort.
This FirstPass Crawler (FPC) is straightforward tool to do just the one thing – Get a clean list of all page URLs on a website.
Without the Metadata, Status codes, and all the other extra data, letting you start your work asap.
This crawler is especially useful when:
- The website does not have an existing sitemap, like a non-wordpress website
- You want to audit internal linking
- You need to migrate a website
- You’re building redirects
- You want to run the second level tools, once you are able to easily and quickly group out sets of urls from the first list.
So I built a simple Node.js crawler that does exactly that; No frameworks and No dependencies, And it’s just one single file.
Download it here
You may >> [ download the CrawlerFPC here ]
and Save the file crawlerfpc.js to the folder C:\crawlerfpc
What This FPC Crawler Does – the Process
This crawler:
- Crawls a website starting from the homepage
- Discovers internal links
- Filters out junk and non-content URLs
- Removes query parameters
- Excludes file downloads (PDFs, images, etc.)
- Avoids infinite crawl traps
- And then Outputs a clean list of page URLs
What It does not do. It Doesn’t:
- Check status codes
- Extract metadata
- Parse SEO tags
- Follow external links
- Crawl subdomains
It’s intentionally simple.When you can extract a first list of urls without the clutter, then the work can be started asap, starting broadly, and then deep digging in the next pass.
FirstPass Crawl Rules
These are the rules I have defined to simplify the process
1) Same-site only
- Treats
example.comandwww.example.comas the same site - Excludes all other subdomains
2) Removes query parameters
Example: https://example.com/page?a=1#topBecomes:
https://example.com/page?
(The ? is retained to indicate the page was discovered with parameters.)
3) Excludes file-like URLs
This is to exclude links of images, documents, etc which are usually not required for the first basic list.
.pdf.jpg,.png,.svg.zip.mp4.css,.js,.json- and other non-HTML assets
4) Avoids crawl traps
Some other non-essentials, and circular references, are also to be avoided. So it excludes:
/admin/login/account/cart/checkout/search- large pagination loops like
/page/999
5) The Crawling Limits are:
Defaults have been set as :
- Maximum 5000 URLs
- Maximum depth of 15
- Optional delay between requests
These can, however, further be defined or overridden by providing parameters to the crawler ( see Optional Settings below)
About Using the FPC Crawler
System Requirements to run the Crawler
- Node.js (LTS version recommended)
- Works on Windows, macOS, and Linux
How to Install (Windows Example)
- Download Node.js LTS from:
https://nodejs.org - Install with default settings
- Verify installation:
node -v
npm -v
How to Run the FPC Crawler
- Save the crawler code as:
crawlerfpc.js - Open Command Prompt and navigate to the folder:
cd C:\crawlerfpc
- Run:
node crawlerfpc.js https://example.com
As it runs, It shows the Progress, urls discovered, the count, and the queue.
When finished, you’ll see something like:
Done. Found XXX URLs.
Saved to urls.txt
Your URLs will be saved in: urls.txt
Optional Settings
You can override defaults:
node crawlerfpc.js https://example.com --max_urls=5000 --max_depth=15 --delay_ms=250 --max_page_num=50
Available options:
--max_urls--max_depth--delay_ms--max_page_num
A delay of 200 or 250 ms avoids errors and rejection
What This FirstPass Crawler is Useful For
This crawler is useful when you need to work on the following:
- A URL inventory before site migration
- A list of pages to check meta tags separately
- A redirect planning sheet
- A quick internal content overview
- A non-WordPress site audit
- A base input for another analysis module
Limitations
Like any crawler:
- It cannot find orphan pages (pages not linked anywhere)
- It cannot access password-protected areas
- It does not execute JavaScript-heavy SPAs
- It only finds what is reachable via HTML links
For most traditional websites, that is enough.
Welcoming Your Thoughts on this
A note to other SEO tinkerers, and similar users.
This is a basic simple tool. Instead of relying on heavyweight SEO software or browser extensions, this crawler gives you
Control, Transparency, Predictable output, Zero dependencies
If you find it useful, feel free to modify and improve it for your own workflow.
And if you build enhancements, consider sharing them.