check-published-links
Validates links on published HTML documentation pages using linkchecker.
Overview
The check-published-links tool uses linkchecker to validate all links on published HTML documentation. Unlike validate-links which checks AsciiDoc source files, this tool checks the actual rendered HTML output—catching issues that only appear after publication.
Key Features
- Published HTML validation - Checks actual rendered documentation, not source files
- URL rewriting - Corrects misresolved paths for platforms with URL routing issues
- Single and bulk modes - Check one URL or a list of documentation pages
- False positive handling - Automatically filters known issues (host:port placeholders, Maven Central 403s)
- Detailed reporting - Generates comprehensive reports with error categorization
- Timeout detection - Identifies slow responses separately from broken links
When to Use
Use this tool when you need to:
- Validate after publishing - Verify links work on the live site
- Check rendered output - Catch issues in generated HTML (relative paths, image references)
- Validate documentation platforms - Handle platforms with URL routing that causes path resolution issues
- Bulk check documentation - Validate multiple guides in a single run
- Generate audit reports - Document link health for compliance or quality assurance
URL Rewriting
Some documentation platforms use URL routing that causes linkchecker to resolve relative paths incorrectly. The --rewrite-pattern and --rewrite-replacement options let you correct these paths automatically.
How It Works
- Linkchecker reports an error for a misresolved URL
- The tool checks if the URL matches a rewrite pattern
- If matched, it verifies the resource exists at the corrected path
- Verified paths are counted as valid (not errors)
Example
# Your docs serve from /docs/en/product/
# But images resolve to /docs/en/PRODUCT_CODE_1.0/images/
check-published-links https://docs.example.com/product/guide \
--rewrite-pattern "/docs/en/product/" \
--rewrite-replacement "/docs/en/PRODUCT_CODE_1.0/"
Basic Usage
Single URL
# Check a single documentation page
check-published-links https://docs.example.com/guide/index.html
# With increased timeout for slow sites
check-published-links https://docs.example.com/guide/ --timeout 60
Bulk Validation
Create a file with URLs to check (one per line):
# urls-to-check.txt
https://docs.example.com/guide1/
https://docs.example.com/guide2/
https://docs.example.com/guide3/
check-published-links --file urls-to-check.txt
With URL Rewriting
# Single rewrite rule
check-published-links https://docs.example.com/product/guide \
--rewrite-pattern "/docs/en/product/" \
--rewrite-replacement "/docs/en/PROD_V1.0/"
# Multiple rewrite rules
check-published-links https://docs.example.com/product/guide \
--rewrite-pattern "/docs/en/product/images/" \
--rewrite-replacement "/docs/en/PROD_V1.0/images/" \
--rewrite-pattern "/docs/en/product/resources/" \
--rewrite-replacement "/docs/en/PROD_V1.0/resources/"
With Custom Ignore Patterns
# Ignore URLs matching a regex pattern
check-published-links https://docs.example.com/guide/ \
--ignore-pattern "^https?://internal\.example\.com" \
--ignore-pattern "^https?://staging\."
Using a Configuration File
# Use explicit config file
check-published-links https://docs.example.com/guide/ --config linkcheck.conf
# Or place .check-published-links.conf in the current directory (auto-loaded)
Command Options
| Option | Description | Default |
|---|---|---|
URL | Single URL to check | - |
--file, -f | File containing URLs to check (one per line) | - |
--timeout, -t | Timeout for each link check in seconds | 15 |
--rewrite-pattern | URL pattern to match for rewriting (can be used multiple times) | None |
--rewrite-replacement | Replacement for matched pattern (must match –rewrite-pattern count) | None |
--ignore-pattern | Regex pattern for URLs to ignore (can be used multiple times) | None |
--config, -c | Configuration file path | .check-published-links.conf |
--reports-dir | Directory for report files | reports/ |
Output Examples
Successful Check
═══════════════════════════════════════════════════════════════
Documentation Link Checker
═══════════════════════════════════════════════════════════════
ℹ URL: https://docs.example.com/guide/
ℹ Guide: guide
ℹ Timeout: 15 seconds
ℹ Report: reports/link-check-report_2024-01-15_10-30-45.txt
✓ PASS - 127 links checked
✓ All documentation links are valid!
Check with False Positives
✓ PASS - All errors were false positives
Note: 3 path(s) verified at corrected location
Note: 2 known issues (host:port) - safe to ignore
Failed Check
✗ FAIL - 5 errors found
Note: 2 path(s) verified at corrected location (not counted)
Error: 1 path(s) NOT FOUND at corrected location
Note: 1 known issues (host:port) - safe to ignore
TIMEOUT: 1 links exceeded timeout limit
Link check found issues. Review the report for details.
Bulk Mode Output
═══════════════════════════════════════════════════════════════
Documentation Link Checker - Bulk Mode
═══════════════════════════════════════════════════════════════
ℹ URL List: urls-to-check.txt
ℹ Total URLs: 15
ℹ Timeout: 15 seconds
ℹ Report: reports/link-check-report_2024-01-15_10-30-45.txt
[1/15] Checking: Getting Started...
✓ PASS - 89 links
[2/15] Checking: Configuration Guide...
✓ PASS - 156 links [2 rewritten OK]
[3/15] Checking: Security Guide...
✗ FAIL - 3 errors found
═══════════════════════════════════════════════════════════════
Results
═══════════════════════════════════════════════════════════════
Total: 15 guides
Pass: 14
Fail: 1
Rate: 93.3%
Known Issues Handled
The tool automatically handles these common false positives:
1. Host:Port Placeholders
URLs like https://host:port/auth are documentation placeholders and are safe to ignore.
2. Comma-Separated URL Lists
Examples like http://www.example.com,http://localhost:3000 showing configuration format are filtered out.
3. Maven Central 403 Errors
Maven Central blocks automated access with 403 Forbidden. These links work in browsers.
4. Misresolved Paths (with rewrite rules)
When rewrite rules are configured, the tool verifies resources at corrected paths and doesn’t count verified paths as errors.
Reports
Reports are saved to the reports/ directory with timestamps:
reports/link-check-report_2024-01-15_10-30-45.txt
Report Contents
- Summary of all guides checked
- Statistics (total errors, known issues, timeouts)
- Detailed error information for failed guides
- URL, parent URL, and result for each error
Configuration File
You can save ignore patterns and rewrite rules in a configuration file to avoid repeating them on the command line.
File Location
The tool automatically loads .check-published-links.conf from the current directory if it exists. You can also specify a custom path with --config.
File Format
# General settings
[settings]
timeout = 30
reports-dir = ./build/link-reports/
# Ignore patterns (one regex per line)
[ignore-patterns]
^https?://internal\.example\.com
^https?://staging\.
^https?://private-api\.
# Rewrite rules (pattern = replacement)
[rewrite-rules]
/docs/en/product/ = /docs/en/PRODUCT_V1.0/
/docs/en/product/images/ = /docs/en/PRODUCT_V1.0/images/
Configuration Sections
| Section | Description |
|---|---|
[settings] | General settings like timeout and reports-dir |
[ignore-patterns] | Regex patterns for URLs to skip (one per line) |
[rewrite-rules] | URL path corrections (pattern = replacement) |
Settings Options
| Setting | Description | Default |
|---|---|---|
timeout | Timeout for each link check in seconds | 15 |
reports-dir | Directory for report files | reports/ |
Precedence
Options are resolved in this order (CLI overrides config file):
For timeout and reports-dir:
- Command-line option (if specified)
- Configuration file value (if present)
- Default value
For ignore patterns and rewrite rules:
- Default ignore patterns (localhost, example.com, etc.)
- Configuration file patterns (added to defaults)
- Command-line options (added to above)
This allows you to extend the defaults without replacing them.
Prerequisites
This tool requires linkchecker to be installed:
# Install with pipx (recommended)
pipx install linkchecker
# Or with pip
pip install linkchecker
For more information about linkchecker, see the linkchecker documentation.
The check-published-links tool wraps linkchecker to add URL rewriting for misresolved paths, automatic filtering of known false positives, and structured reporting with error categorization.
Performance Considerations
- Timeout: Default is 15 seconds per link. Increase for slow servers.
- External links: All external links are checked, which can be slow for large documentation sets.
- Rate limiting: Some servers may rate-limit automated checks. Consider running during off-peak hours.
Troubleshooting
Timeout Issues
If many links are timing out:
check-published-links https://docs.example.com/guide/ --timeout 60
SSL Certificate Issues
Linkchecker uses system certificates. Ensure your system’s CA certificates are up to date.
Too Many False Positives
Add custom ignore patterns via CLI or configuration file:
# Via CLI
check-published-links https://docs.example.com/guide/ \
--ignore-pattern "^https?://known-flaky-server\.com"
# Via config file (.check-published-links.conf)
[ignore-patterns]
^https?://known-flaky-server\.com
^https?://internal\.example\.com
Related Tools
- validate-links - Validates links in AsciiDoc source files
- extract-link-attributes - Extract links into reusable attributes
Comparison: validate-links vs check-published-links
| Feature | validate-links | check-published-links |
|---|---|---|
| Input | AsciiDoc source files | Published HTML URLs |
| When to use | Before publishing | After publishing |
| Attribute resolution | Yes | N/A |
| Cross-reference checking | Yes (xref:) | N/A |
| Rendered output checking | No | Yes |
| Image path verification | Source paths | Rendered paths |
| URL transposition | Yes | Via rewrite rules |
| Requires | Python only | linkchecker |
Use both tools together:
validate-linksduring development to catch issues earlycheck-published-linksafter publishing to verify the rendered output