Substack Broken Link Checker

A fast, async Python tool to find broken links in your Substack newsletter archive.

Why This Tool?

Checking broken links in your newsletter archive shouldn't cost $100+/month for tools like Semrush or Ahrefs. This free, open-source tool:

Works with Substack's bot protection - Uses your session cookie to authenticate as a logged-in user
Handles large archives efficiently - Async concurrent checking is 10-20x faster than sequential
Tracks what you've already checked - Incremental scanning means you only check new posts

Features

Fast: Async concurrent checking (10-20x faster than sequential)
Smart caching: Same link across multiple posts? Checked once
Retry logic: Exponential backoff for transient failures
Incremental scanning: Track checked posts, only scan new ones
Domain filtering: Skip bot-blocking sites (Wikipedia), auto-flag known broken domains
Multiple error types: HTTP 404, soft 404s, SSL errors, DNS failures, timeouts

Quick Start

# Install (provides the `substack-link-checker` CLI)
pip install git+https://github.com/jcddc83/substack-broken-link-checker.git

# Check all posts from 2024
substack-link-checker check --base-url https://YOUR.substack.com --year 2024

# Check posts from a file
substack-link-checker check --base-url https://YOUR.substack.com --url-file posts.txt

Installation

git clone https://github.com/jcddc83/substack-broken-link-checker.git
cd substack-broken-link-checker
pip install -e .

Or directly from GitHub:

pip install git+https://github.com/jcddc83/substack-broken-link-checker.git

This installs the substack-link-checker console command. Equivalent invocations:

substack-link-checker check ...
python -m substack_link_checker check ...

Requirements: Python 3.8+

Migrating from v1.0.0

v1.0.0 shipped flat scripts at the repo root. They have been reorganised into a substack_link_checker package with a subcommand CLI.

Old (v1.0.0)	New
`python substack_link_checker.py ...`	`substack-link-checker check ...`
`python compare_posts.py ...`	`substack-link-checker compare ...`
`python import_checked_posts.py ...`	`substack-link-checker import ...`
`python fetch_archive_urls.py ...`	`substack-link-checker fetch-archive ...`
`python demo_link_checker.py`	`substack-link-checker demo`

The four helper scripts (compare_posts.py, import_checked_posts.py, fetch_archive_urls.py, demo_link_checker.py) are kept at the root as thin back-compat shims, so existing python compare_posts.py ... invocations and the bundled PowerShell scheduled task continue to work. The main substack_link_checker.py script could not be kept as a shim because its name collides with the new package — use substack-link-checker check ... or python -m substack_link_checker check ... instead.

Authentication (Optional)

If Substack blocks your requests or you need to check paywalled content, use your session cookie:

Log into your Substack in a browser
Open Developer Tools (F12) → Application → Cookies
Find the substack.sid cookie and copy its value
Provide it via the SUBSTACK_COOKIE environment variable (recommended) or the --cookie flag:

# Recommended: env var (keeps cookie out of shell history / ps aux)
export SUBSTACK_COOKIE="your-substack-sid-cookie-value"
substack-link-checker check --base-url https://YOUR.substack.com --year 2024

# Alternative: --cookie flag (visible in process listings)
substack-link-checker check --base-url https://YOUR.substack.com --year 2024 \
    --cookie "your-substack-sid-cookie-value"

Security: Treat the session cookie like a password. Prefer the env var so it does not end up in your shell history or in ps aux. See SECURITY.md for full guidance.

Note: Your session cookie expires after a few weeks. If you start getting 403 errors, get a fresh cookie from your browser.

Troubleshooting

Common failure modes and how to fix them:

`HTTP 403 Forbidden` when fetching the sitemap or post pages

Substack's bot protection is rejecting unauthenticated requests. In order of likelihood:

Set SUBSTACK_COOKIE (see Authentication above) so you're requesting as a logged-in user.
If you had a cookie set: it has probably expired (Substack rotates session cookies every few weeks). Grab a fresh one from DevTools.
If both are current: lower --concurrency (try --concurrency 3) so you look less bot-like.

`Sitemap returns no posts for --year YYYY`

The year-specific sitemap (e.g. /sitemap-2024.xml) doesn't exist for your Substack — some accounts only expose a single combined sitemap. Fall back to scraping the archive page:

substack-link-checker fetch-archive https://YOUR.substack.com 2024
# Produces archive_urls_2024.txt
substack-link-checker check --base-url https://YOUR.substack.com \
    --url-file archive_urls_2024.txt

`DNS Failure` or `Timeout` for links that work in your browser

The target site is rate-limiting or geo-blocking the checker, not actually broken. Add it to --skip-domains so it's assumed OK:

substack-link-checker check ... --skip-domains rate-limited.example.com

For a recurring list, put one domain per line in a file and pass --skip-domains-file path/to/file.txt.

`Connection Error: ...ssl:default` / `SSL Error`

The target host is using an old TLS version Python's ssl module no longer accepts by default. Usually the right call is to flag the domain as broken (it really is unreachable from a modern client):

substack-link-checker check ... --broken-domains old-tls.example.com

Many `Soft 404 (page title indicates error)` results that look fine

The detector matches phrases like "page not found" in the page <title>. If a legitimate post happens to have one of those phrases in its title, it'll be misflagged. Open the report, eyeball the URL, and if it's genuinely live, ignore those rows.

The CSV report file is empty / has only a header

Either no broken links were found (look for "No broken links found!" in the summary) or the run was interrupted before report generation. The tool only writes the CSV on a successful completion of all posts.

`--only-new` is not skipping anything

Make sure --history-file points at the same JSON file you used on the previous run. The history file is the source of truth for which posts have already been checked; without it --only-new has nothing to compare against.

Usage

Basic Usage

# Check posts from a specific year (uses sitemap)
substack-link-checker check --base-url https://example.substack.com --year 2024

# Check posts from a URL file
substack-link-checker check --base-url https://example.substack.com --url-file posts.txt

# Verbose output with custom report name
substack-link-checker check --base-url https://example.substack.com --year 2024 \
    --verbose --output december_report.csv

Incremental Scanning (Recommended)

Track which posts you've already checked to avoid re-scanning:

# First run: checks all posts, saves history
substack-link-checker check --base-url https://example.substack.com --year 2024 \
    --history-file checked_posts.json

# Subsequent runs: only check new posts
substack-link-checker check --base-url https://example.substack.com --year 2024 \
    --history-file checked_posts.json --only-new

Domain Filtering

# Skip domains that block bots (assumed OK)
substack-link-checker check ... --skip-domains wikipedia.org

# Auto-flag domains as broken without checking
substack-link-checker check ... --broken-domains old.defunct-site.com

Finding Unchecked Posts

# Compare your sitemap against history to find unchecked posts
substack-link-checker compare https://example.substack.com checked_posts.json
# Outputs: unchecked_posts.txt

# Then check just those posts
substack-link-checker check --base-url https://example.substack.com \
    --url-file unchecked_posts.txt --history-file checked_posts.json

Example Output

$ substack-link-checker check --base-url https://example.substack.com --year 2024

Substack Broken Link Checker
==================================================
Base URL: https://example.substack.com
Concurrency: 10
Max retries: 3
Input: Sitemap
Year: 2024
==================================================

Found 45 posts from 2024
[1/45] Processing: https://example.substack.com/p/my-first-post
  Checking 12 links (10 new, 2 cached)...
  Found 1 broken links in this post

[2/45] Processing: https://example.substack.com/p/another-post
  Checking 8 links (6 new, 2 cached)...
  Found 0 broken links in this post
...

Completed in 34.2 seconds

==================================================
SUMMARY
==================================================
Total links checked: 234
Links skipped (assumed OK): 8
Links auto-flagged broken: 0
Cache hits: 45
Retries performed: 3
Broken links found: 5

Generating report: broken_links_report.csv
Report generated with 5 broken links

CLI Options

Option	Short	Description
`--base-url`	`-b`	Your Substack URL (required)
`--year`	`-y`	Year to check (uses sitemap)
`--url-file`	`-f`	File with post URLs (one per line)
`--output`	`-o`	Output CSV filename (default: broken_links_report.csv)
`--concurrency`	`-c`	Parallel requests (default: 10)
`--timeout`	`-t`	Request timeout in seconds (default: 10)
`--max-retries`	`-r`	Retry attempts for failures (default: 3)
`--history-file`	`-H`	JSON file for tracking checked posts
`--only-new`		Only check posts not in history
`--skip-domains`	`-S`	Domains to skip (assumed OK)
`--skip-domains-file`		File with domains to skip (one per line)
`--broken-domains`	`-B`	Domains to auto-flag as broken
`--broken-domains-file`		File with domains to auto-flag (one per line)
`--cookie`	`-C`	Substack session cookie for authentication
`--verbose`	`-v`	Show detailed progress
`--limit`	`-l`	Max posts to check

Subcommands

Command	Purpose
`substack-link-checker check`	Main link checker (the command shown throughout this README)
`substack-link-checker compare`	Find posts not yet checked (sitemap vs history)
`substack-link-checker import`	Import previous results from Excel/CSV into history
`substack-link-checker fetch-archive`	Extract URLs from the `/archive` page (fallback when the sitemap doesn't work)
`substack-link-checker demo`	Self-contained demo against a handful of known-good/bad URLs
`run_link_checker.ps1`	Windows Task Scheduler automation (PowerShell)

Output

The tool generates a CSV report with columns:

Post Title: Title of the post containing the broken link
Post URL: URL of the post
Broken Link: The broken URL
Error Type: What went wrong (HTTP 404, DNS Failure, SSL Error, etc.)

Error Types Detected

HTTP 404 - Page not found
HTTP 4xx/5xx - Other HTTP errors
Soft 404 - Page loads but title indicates error
DNS Failure - Domain doesn't exist
SSL Error - Certificate problems
Timeout - Server didn't respond
Connection Error - Network issues
Known broken domain - Auto-flagged via --broken-domains

License

MIT License - see LICENSE file.

Contributing

Issues and pull requests welcome at github.com/jcddc83/substack-broken-link-checker. See CONTRIBUTING.md for guidelines and SECURITY.md for reporting security issues.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github		.github
src/substack_link_checker		src/substack_link_checker
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
USAGE.md		USAGE.md
compare_posts.py		compare_posts.py
demo_link_checker.py		demo_link_checker.py
example_urls.txt		example_urls.txt
fetch_archive_urls.py		fetch_archive_urls.py
import_checked_posts.py		import_checked_posts.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_link_checker.ps1		run_link_checker.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Substack Broken Link Checker

Why This Tool?

Features

Quick Start

Installation

Migrating from v1.0.0

Authentication (Optional)

Troubleshooting

`HTTP 403 Forbidden` when fetching the sitemap or post pages

`Sitemap returns no posts for --year YYYY`

`DNS Failure` or `Timeout` for links that work in your browser

`Connection Error: ...ssl:default` / `SSL Error`

Many `Soft 404 (page title indicates error)` results that look fine

The CSV report file is empty / has only a header

`--only-new` is not skipping anything

Usage

Basic Usage

Incremental Scanning (Recommended)

Domain Filtering

Finding Unchecked Posts

Example Output

CLI Options

Subcommands

Output

Error Types Detected

License

Contributing

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Substack Broken Link Checker

Why This Tool?

Features

Quick Start

Installation

Migrating from v1.0.0

Authentication (Optional)

Troubleshooting

HTTP 403 Forbidden when fetching the sitemap or post pages

Sitemap returns no posts for --year YYYY

DNS Failure or Timeout for links that work in your browser

Connection Error: ...ssl:default / SSL Error

Many Soft 404 (page title indicates error) results that look fine

The CSV report file is empty / has only a header

--only-new is not skipping anything

Usage

Basic Usage

Incremental Scanning (Recommended)

Domain Filtering

Finding Unchecked Posts

Example Output

CLI Options

Subcommands

Output

Error Types Detected

License

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`HTTP 403 Forbidden` when fetching the sitemap or post pages

`Sitemap returns no posts for --year YYYY`

`DNS Failure` or `Timeout` for links that work in your browser

`Connection Error: ...ssl:default` / `SSL Error`

Many `Soft 404 (page title indicates error)` results that look fine

`--only-new` is not skipping anything

Packages