A FirstPass Crawler (in node.js) to List URLs of any website

Table of Contents

About the FirstPass Crawler (FPC)

This is a simple Website crawler that I vibe-coded (mostly with ChatGpt, and 2 more AIs) , mainly for starting SEO analysis of non-wordpress websites.
Why? Because the first thing that you need to check out a website, is to have a fast overview of it’s contents. Usually I fire up 2,3 website crawlers and compile the results. But then I also have to go through the long url lists, and sieve and sort and discard and filter out my first list of clean urls that I really need. Much of this is an overkill with excessive details and wastes time and effort.

This FirstPass Crawler (FPC) is straightforward tool to do just the one thing – Get a clean list of all page URLs on a website.
Without the Metadata, Status codes, and all the other extra data, letting you start your work asap.

This crawler is especially useful when:

The website does not have an existing sitemap, like a non-wordpress website
You want to audit internal linking
You need to migrate a website
You’re building redirects
You want to run the second level tools, once you are able to easily and quickly group out sets of urls from the first list.

So I built a simple Node.js crawler that does exactly that; No frameworks and No dependencies, And it’s just one single file.

Download it here

You may >> [ download the CrawlerFPC here ]

and Save the file crawlerfpc.js to the folder C:\crawlerfpc

What This FPC Crawler Does – the Process

This crawler:

Crawls a website starting from the homepage
Discovers internal links
Filters out junk and non-content URLs
Removes query parameters
Excludes file downloads (PDFs, images, etc.)
Avoids infinite crawl traps
And then Outputs a clean list of page URLs

What It does not do. It Doesn’t:

Check status codes
Extract metadata
Parse SEO tags
Follow external links
Crawl subdomains

It’s intentionally simple.When you can extract a first list of urls without the clutter, then the work can be started asap, starting broadly, and then deep digging in the next pass.

FirstPass Crawl Rules

These are the rules I have defined to simplify the process

1) Same-site only

Treats example.com and www.example.com as the same site
Excludes all other subdomains

2) Removes query parameters

Example: https://example.com/page?a=1#topBecomes: https://example.com/page?

(The ? is retained to indicate the page was discovered with parameters.)

3) Excludes file-like URLs

This is to exclude links of images, documents, etc which are usually not required for the first basic list.

.pdf
.jpg, .png, .svg
.zip
.mp4
.css, .js, .json
and other non-HTML assets

4) Avoids crawl traps

Some other non-essentials, and circular references, are also to be avoided. So it excludes:

/admin
/login
/account
/cart
/checkout
/search
large pagination loops like /page/999

5) The Crawling Limits are:

Defaults have been set as :

Maximum 5000 URLs
Maximum depth of 15
Optional delay between requests

These can, however, further be defined or overridden by providing parameters to the crawler ( see Optional Settings below)

About Using the FPC Crawler

System Requirements to run the Crawler

Node.js (LTS version recommended)
Works on Windows, macOS, and Linux

How to Install (Windows Example)

Download Node.js LTS from: https://nodejs.org
Install with default settings
Verify installation:

node -v
npm -v

How to Run the FPC Crawler

Save the crawler code as: crawlerfpc.js
Open Command Prompt and navigate to the folder:

cd C:\crawlerfpc

Run:

node crawlerfpc.js https://example.com

As it runs, It shows the Progress, urls discovered, the count, and the queue.
When finished, you’ll see something like:

Done. Found XXX URLs.
Saved to urls.txt

Your URLs will be saved in: urls.txt

Optional Settings

You can override defaults:

node crawlerfpc.js https://example.com --max_urls=5000 --max_depth=15 --delay_ms=250 --max_page_num=50

Available options:

--max_urls
--max_depth
--delay_ms
--max_page_num

A delay of 200 or 250 ms avoids errors and rejection

What This FirstPass Crawler is Useful For

This crawler is useful when you need to work on the following:

A URL inventory before site migration
A list of pages to check meta tags separately
A redirect planning sheet
A quick internal content overview
A non-WordPress site audit
A base input for another analysis module

Limitations

Like any crawler:

It cannot find orphan pages (pages not linked anywhere)
It cannot access password-protected areas
It does not execute JavaScript-heavy SPAs
It only finds what is reachable via HTML links

For most traditional websites, that is enough.

Welcoming Your Thoughts on this

A note to other SEO tinkerers, and similar users.

This is a basic simple tool. Instead of relying on heavyweight SEO software or browser extensions, this crawler gives you
Control, Transparency, Predictable output, Zero dependencies

If you find it useful, feel free to modify and improve it for your own workflow.
And if you build enhancements, consider sharing them.