Find Duplicate Content
Review Before Refactoring
This tool identifies duplicate and similar content that may be candidates for refactoring into reusable components. Before making changes:
- Review each duplicate carefully — some repetition is intentional
- Consider context — similar content might need different wording for different audiences
- Plan your refactoring strategy — decide whether to use includes, attributes, or shared modules
- Test your documentation build after consolidating content
What This Tool Detects
- Admonition blocks:
NOTE:,TIP:,WARNING:,IMPORTANT:,CAUTION:(both inline and delimited) - Tables: Content between
|===delimiters - Step sequences: Ordered lists (numbered procedures)
- Code blocks: Content between
----or....delimiters
Installation
After installing the package, run the tool from anywhere:
find-duplicate-content [directory] [options]
Or run from source:
python3 find_duplicate_content.py [directory] [options]
Options
| Option | Description |
|---|---|
directory | Directory to scan (default: current directory) |
-t, --type TYPE | Block types to search for (can be repeated). Choices: note, tip, warning, important, caution, table, steps, code |
-s, --similarity THRESHOLD | Minimum similarity threshold (0.0-1.0). Default: 0.8 |
-m, --min-length CHARS | Minimum content length to consider. Default: 50 |
--exact-only | Only find exact duplicates |
-e, --exclude-dir DIR | Directory to exclude (can be repeated) |
--no-content | Hide content preview in output |
--no-output | Do not write report file |
--format FORMAT | Output format: txt (default), csv, json, md |
Examples
# Scan current directory for all types of duplicates
find-duplicate-content
# Scan a specific directory
find-duplicate-content ./docs/modules
# Find only notes, tables, and tips
find-duplicate-content -t note -t table -t tip
# Find exact duplicates only
find-duplicate-content --exact-only
# Find content with 70% or greater similarity
find-duplicate-content -s 0.7
# Exclude specific directories
find-duplicate-content -e archive -e drafts
# Write CSV report for spreadsheet analysis
find-duplicate-content --format csv
# Display results only, no report file
find-duplicate-content --no-output
Output
By default, the tool writes a report to ./reports/duplicate-content_YYYY-MM-DD_HH-MM-SS.txt. Use --format to change the output format or --no-output to skip file generation.
Sample Text Report
✓ Found 3 groups of duplicate content
Command: find-duplicate-content -t note -t table
Directory: /home/user/my-docs
Found 3 groups of duplicate/similar content
======================================================================
[1] NOTE (EXACT) - 4 occurrences
--------------------------------------------------
Content preview:
For more information about configuring SSL/TLS, see the
Security Guide.
Locations:
- modules/getting-started/pages/configuration.adoc:45
- modules/security/pages/overview.adoc:23
- modules/deployment/pages/kubernetes.adoc:112
- modules/reference/pages/troubleshooting.adoc:78
[2] TABLE (85% similar) - 2 occurrences
--------------------------------------------------
Content preview:
|===
| Property | Default | Description
| quarkus.http.port | 8080 | The HTTP port
...
Locations:
- modules/configuration/pages/http.adoc:34
- modules/reference/pages/properties.adoc:156
CSV Format
Block Type,Similarity,Occurrences,File Path,Line Number,Content Preview
"note","exact",4,"modules/getting-started/pages/configuration.adoc",45,"For more information about..."
"table","0.85",2,"modules/configuration/pages/http.adoc",34,"|=== | Property | Default..."
Understanding Similarity
The tool uses word-based Jaccard similarity to compare content blocks:
| Similarity | Meaning |
|---|---|
| 1.0 (100%) | Exact duplicates (identical content after normalizing whitespace) |
| 0.8-0.99 | Very similar content with minor differences |
| 0.7-0.79 | Similar structure with some different wording |
| Below 0.7 | Generally different content (not reported by default) |
Adjust the -s threshold based on your needs:
--exact-onlyfor identical content only-s 0.9for very similar content-s 0.7for a broader search including paraphrased content
Refactoring Strategies
After identifying duplicates, consider these approaches:
Create Reusable Includes
For content that should be identical everywhere:
// In _partials/ssl-note.adoc
NOTE: For more information about configuring SSL/TLS, see the Security Guide.
// In your documentation files
\include::_partials/ssl-note.adoc[]
Use Attributes for Variable Content
For content that differs slightly:
// In attributes.adoc
:ssl-note: For more information about configuring SSL/TLS, see the Security Guide.
// In your documentation files
NOTE: {ssl-note}
Create Shared Modules
For larger blocks of content (procedures, tables):
// In shared/procedures/terminal-setup.adoc
. Open your terminal.
. Navigate to your project directory.
. Run the following command:
// In your documentation files
\include::shared::procedures/terminal-setup.adoc[]
Limitations
- Context sensitivity: The tool cannot determine if similar content serves different purposes in different contexts
- Semantic similarity: Word-based comparison may miss semantically similar content with different wording
- Large files: Very large files may take longer to process
- Nested structures: Complex nested structures may not be parsed perfectly
Notes
- Only
.adocfiles are scanned - Symlinks are ignored
- Default excluded directories:
.git,.archive,target,build,node_modules - The tool does not modify any files
See the main README.md for installation and usage details.