Skip to content

gurpreet's blog

thoughts and moments to revisit

  • About
  • Topics
    • General
    • Fast Forward
    • Productivity
    • Numbers
    • Content & SEO
  • Bookmarks
    • AI
    • Nuggets
  • Toggle search form

A FirstPass Crawler (in node.js) to List URLs of any website

Posted on February 21, 2026February 22, 2026 By Editor No Comments on A FirstPass Crawler (in node.js) to List URLs of any website

Table of Contents

Toggle
  • About the FirstPass Crawler (FPC)
    • Download it here
    • What This FPC Crawler Does – the Process
    • FirstPass Crawl Rules
      • 1) Same-site only
      • 2) Removes query parameters
      • 3) Excludes file-like URLs
      • 4) Avoids crawl traps
      • 5) The Crawling Limits are:
  • About Using the FPC Crawler
    • System Requirements to run the Crawler
    • How to Install (Windows Example)
    • How to Run the FPC Crawler
    • Optional Settings
  • What This FirstPass Crawler is Useful For
    • Limitations
  • Welcoming Your Thoughts on this

About the FirstPass Crawler (FPC)

This is a simple Website crawler that I vibe-coded (mostly with ChatGpt, and 2  more AIs) , mainly for starting SEO analysis of non-wordpress websites.
Why? Because the first thing that you need to check out a website, is  to have a fast overview of it’s contents. Usually I fire up 2,3 website crawlers and compile the results. But then I also have to go through the long url lists, and sieve and sort and discard and filter out my first list of clean urls that I really need. Much of this is an overkill with excessive details and wastes time and effort.

This FirstPass Crawler (FPC) is straightforward tool to do just the one thing – Get a clean list of all page URLs on a website.
Without the Metadata, Status codes, and all the other extra data, letting you start your work asap.

This crawler is especially useful when:

  • The website does not have an existing sitemap, like a non-wordpress  website
  • You want to audit internal linking
  • You need to migrate a website
  • You’re building redirects
  • You want to run the second level tools, once you are able to easily and quickly group out sets of urls from the first  list.

So I built a simple Node.js crawler that does exactly that; No frameworks and  No dependencies, And it’s just one single file.

Download it here

You may >> [ download the CrawlerFPC here ]

and Save the file crawlerfpc.js to the folder C:\crawlerfpc

What This FPC Crawler Does – the Process

This crawler:

  • Crawls a website starting from the homepage
  • Discovers internal links
  • Filters out junk and non-content URLs
  • Removes query parameters
  • Excludes file downloads (PDFs, images, etc.)
  • Avoids infinite crawl traps
  • And then Outputs a clean list of page URLs

What It does not do. It Doesn’t:

  • Check status codes
  • Extract metadata
  • Parse SEO tags
  • Follow external links
  • Crawl subdomains

It’s intentionally simple.When you can extract a first list of urls without the clutter, then the work can be started asap, starting broadly, and then deep digging in the next pass.

FirstPass Crawl Rules

These are  the rules I have defined to simplify the process

1) Same-site only

  • Treats example.com and www.example.com as the same site
  • Excludes all other subdomains

2) Removes query parameters

Example:  https://example.com/page?a=1#top
Becomes: https://example.com/page?

(The ? is retained to indicate the page was discovered with parameters.)

3) Excludes file-like URLs

This is to exclude links of images, documents, etc which are usually not required for the first basic list.

  • .pdf
  • .jpg, .png, .svg
  • .zip
  • .mp4
  • .css, .js, .json
  • and other non-HTML assets

4) Avoids crawl traps

Some other non-essentials, and circular references, are also to be avoided. So it excludes:

  • /admin
  • /login
  • /account
  • /cart
  • /checkout
  • /search
  • large pagination loops like /page/999

5) The Crawling Limits are:

Defaults have been set as :

  • Maximum 5000 URLs
  • Maximum depth of 15
  • Optional delay between requests

These can, however, further be defined or overridden by providing parameters to the crawler ( see Optional Settings below)

About Using the FPC Crawler

System Requirements to run the Crawler

  • Node.js (LTS version recommended)
  • Works on Windows, macOS, and Linux

How to Install (Windows Example)

  1. Download Node.js LTS from: https://nodejs.org
  2. Install with default settings
  3. Verify installation:
node -v
npm -v

How to Run the FPC Crawler

  1. Save the crawler code as: crawlerfpc.js
  2. Open Command Prompt and navigate to the folder:
cd C:\crawlerfpc
  1. Run:
node crawlerfpc.js https://example.com

As it runs, It shows the Progress, urls discovered, the count, and the queue.
When finished, you’ll see something like:

Done. Found XXX URLs.
Saved to urls.txt

Your URLs will be saved in: urls.txt

Optional Settings

You can override defaults:

node crawlerfpc.js https://example.com --max_urls=5000 --max_depth=15 --delay_ms=250 --max_page_num=50

Available options:

  • --max_urls
  • --max_depth
  • --delay_ms
  • --max_page_num

A delay of 200 or 250 ms avoids errors and rejection


What This FirstPass Crawler is Useful For

This crawler is useful when you need to work on the following:

  • A URL inventory before site migration
  • A list of pages to check meta tags separately
  • A redirect planning sheet
  • A quick internal content overview
  • A non-WordPress site audit
  • A base input for another analysis module

Limitations

Like any crawler:

  • It cannot find orphan pages (pages not linked anywhere)
  • It cannot access password-protected areas
  • It does not execute JavaScript-heavy SPAs
  • It only finds what is reachable via HTML links

For most traditional websites, that is enough.

Welcoming Your Thoughts on this

A note to other SEO tinkerers, and similar users.

This is a basic simple tool. Instead of relying on heavyweight SEO software or browser extensions, this crawler gives you
Control, Transparency, Predictable output, Zero dependencies

If you find it useful, feel free to modify and improve it for your own workflow.
And if you build enhancements, consider sharing them.

Content & SEO, Productivity Tags:Tools

Post navigation

Previous Post: the gst invoice generator

Related Posts

Youtube content creation earnings ? Content & SEO
How to find daily content ideas for free without using tools like Ahrefs Content & SEO
A Place and Time for everything Productivity
This Year’s 2025 Calendar memorized – a simple productivity hack for the numerically inclined Numbers
INSTA Notecard v21 Productivity
POMO vs MOPO, a take on Pomodoro technique Productivity

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

7.

Recent Posts

  • A FirstPass Crawler (in node.js) to List URLs of any website
  • the gst invoice generator
  • The Sikh Faith – Quotes from
  • Zero Note
  • INSTA Notecard v21

Recent Comments

  1. This Year's 2025 Calendar memorized - a simple productivity hack for the numerically inclined - gurpreet's blog on 2026 Calendar [366 240 251 361] – Instant Day Finder
  2. 2026 Calendar [] - gurpreet's blog on This Year’s 2025 Calendar memorized – a simple productivity hack for the numerically inclined
  3. AI evolution - Its Vibe - gurpreet's blog on VERSA Ledger – A Versatile Ledger- and a diary
  4. 2020s - this Age evolution is now in the Age of Tokenization, and Robotic AI, of course - gurpreet's blog on AI evolution – Its Vibe
  5. AI evolution - gurpreet's blog on 2020s – this Age evolution is now in the Age of Tokenization, and Robotic AI, of course

Search

Tags

AI content Education Hugo incomplete jeevan-sutra Learning life-lessons Listen Maths Nostr sikh Tools Watch later again Wordpress

Categories

Bookmarks Content & SEO Diary Fast Forward general Learning Nuggets Numbers Productivity Revisit sikhism

Other

  • all posts

Copyright © 2026 gurpreet's blog.

Powered by PressBook Grid Dark theme