Classify Guides by Diataxis Type
This tool classifies AsciiDoc documentation guides by Diataxis content type (tutorial, how-to, concept, reference) using structural heuristics and optional LLM-assisted classification.
Installation
This is a standalone script (not a registered CLI entry point). Run it directly from the doc-utils directory:
python3 classify-guides.py [options]
How It Works
Classification uses three sources, checked in order:
- Metadata — If a file declares
:diataxis-type:in its AsciiDoc header, that value is used immediately with HIGH confidence. - Structural heuristics — The file is scanned for structural signals (headings, code blocks, prose patterns, config tables, etc.) and each Diataxis type receives a score. The highest-scoring type wins.
- LLM fallback (optional) — For guides that heuristics cannot confidently classify, an LLM analyzes a structural summary of the file and provides a classification.
Heuristic Signals
| Type | Signals detected |
|---|---|
| Tutorial | “Prerequisites”, “Creating the Maven project”, “Running the application” headings; include::{includes}/devtools/ includes; “step by step” language |
| How-to | “Procedure”, “Setting up”, “Configuring” headings; ordered list steps (. Step); imperative sentence openers (“Create”, “Configure”, “Enable”); -howto.adoc filename suffix |
| Reference | “Configuration Reference” headings; include::{generated-dir}/config/ includes; [cols= config tables; “reference” or “configuration” in title |
| Concept | “Overview”, “What is”, “How X works”, “Architecture” headings; explanatory prose patterns; image:: diagrams; high xref density with low code density |
Mixed-Type Detection
When two types both score above a threshold, the guide is classified as mixed (e.g., mixed:tutorial+reference). For tutorial+reference combinations, the output includes approximate split-point line numbers.
Usage
# Scan current directory for .adoc files
python3 classify-guides.py
# Scan a specific directory
python3 classify-guides.py --adoc-dir /path/to/docs
# Use a quarkus.yaml metadata file (classifies all guides)
python3 classify-guides.py --yaml-file /path/to/quarkus.yaml --all
# Write results to a specific file
python3 classify-guides.py --output my-results.yaml
Options
| Option | Description |
|---|---|
--adoc-dir DIR | Directory containing .adoc guide files (default: current directory) |
--yaml-file FILE | Path to quarkus.yaml metadata (optional; without it, scans --adoc-dir directly) |
--output FILE | Output YAML file (default: guide-classifications.yaml) |
--all | Classify all guides, not just type:guide entries (only relevant with --yaml-file) |
--llm | Use LLM classification for low-confidence or unclassified guides |
--llm-all | Use LLM classification for all guides |
--llm-provider PROVIDER | LLM provider: auto, gemini, anthropic, ollama (default: auto) |
--llm-api-key KEY | API key for the LLM provider (default: read from environment) |
LLM Configuration
The --llm flag enables LLM-assisted classification for guides that heuristics classify with LOW or no confidence. The --llm-all flag runs LLM classification on every guide.
Provider Auto-Detection
With --llm-provider auto (the default), the tool checks for available providers in this order:
- Google Gemini — if
GEMINI_API_KEYorGOOGLE_API_KEYis set - Anthropic Claude — if
ANTHROPIC_API_KEYis set - Ollama (local) — if an Ollama server is running at
http://localhost:11434
If no provider is available, a warning is printed and classification continues with heuristics only.
Google Gemini (free tier)
Gemini offers a free tier with 60 requests per minute, sufficient for classifying large documentation sets.
- Get an API key at https://aistudio.google.com/apikey.
- Set the environment variable:
export GEMINI_API_KEY="your-api-key-here"
- Run with LLM enabled:
python3 classify-guides.py --all --llm
The tool uses the gemini-2.0-flash model by default.
Anthropic Claude
- Get an API key at https://console.anthropic.com/.
- Set the environment variable:
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
- Run with LLM enabled:
python3 classify-guides.py --all --llm
# or explicitly:
python3 classify-guides.py --all --llm --llm-provider anthropic
The tool uses claude-haiku-4-5 for fast, low-cost classification.
Ollama (local, fully offline)
Ollama runs models locally with no API key or internet connection required.
- Install Ollama: https://ollama.com/download
- Pull a model:
ollama pull llama3.2
- Start the Ollama server (if not already running):
ollama serve
- Run with LLM enabled:
python3 classify-guides.py --all --llm --llm-provider ollama
Passing an API Key Directly
Instead of using environment variables, you can pass the API key on the command line:
python3 classify-guides.py --llm --llm-provider gemini --llm-api-key "your-key"
LLM Caching
LLM results are cached in ~/.cache/doc-utils/llm-classifications/ so that repeated runs do not re-classify the same guides. To clear the cache:
rm -rf ~/.cache/doc-utils/llm-classifications/
How LLM Results Are Merged
The LLM does not override heuristic results. Instead, it acts as a weighted tiebreaker:
| Scenario | Result |
|---|---|
| Heuristic is HIGH confidence | Heuristic wins. LLM result recorded for comparison. |
| Both agree | Confidence boosted to HIGH. |
| Heuristic is LOW/NONE, LLM is confident | LLM result used, capped at MEDIUM confidence. |
| Disagreement at MEDIUM confidence | Heuristic kept. Both results recorded. |
The output YAML includes llm_type and llm_agrees fields when LLM classification is used.
Output Format
The tool writes a YAML file with two sections:
classified — Per-guide results
classified:
- url: /guides/security-architecture
filename: security-architecture.adoc
title: Quarkus Security architecture
current_type: guide
suggested_type: concept
confidence: high
reason: "explicit :diataxis-type: concept attribute in file header"
source: metadata
lines: 112
code_blocks: 0
sections: 8
The source field indicates where the classification came from:
| Value | Meaning |
|---|---|
metadata | From :diataxis-type: attribute in the file header |
heuristic | From structural pattern analysis |
llm | From LLM classification (heuristic was low confidence) |
heuristic+llm | Both heuristic and LLM agreed |
error | File could not be read |
summary — Aggregate counts
summary:
total_analyzed: 268
by_type:
concept: 26
howto: 40
tutorial: 15
reference: 30
mixed:tutorial+reference: 22
guide: 18
by_confidence:
high: 214
medium: 26
low: 9
none: 19
by_source:
metadata: 57
heuristic: 210
error: 1
Examples
Classify all guides and review the summary:
python3 classify-guides.py --all
Find guides that need manual review (low confidence or unclassified):
python3 classify-guides.py --all --output results.yaml
python3 -c "
import yaml
with open('results.yaml') as f:
data = yaml.safe_load(f)
for g in data['classified']:
if g['confidence'] in ('low', 'none'):
print(f\"{g['confidence']:6s} {g['suggested_type']:20s} {g['filename']}\")
"
Use LLM to resolve low-confidence classifications:
export GEMINI_API_KEY="your-key"
python3 classify-guides.py --all --llm --output results-with-llm.yaml
Compare heuristic vs. LLM classifications:
python3 -c "
import yaml
with open('results-with-llm.yaml') as f:
data = yaml.safe_load(f)
for g in data['classified']:
if 'llm_type' in g and not g.get('llm_agrees', True):
print(f\"DISAGREE: {g['filename']}\")
print(f\" Heuristic: {g['suggested_type']} ({g['confidence']})\")
print(f\" LLM: {g['llm_type']}\")
"
See the main README.md for installation and general usage.