Normalize-First Canonicalization¶
Put Unicode normalization at the front of every text pipeline, run the remaining steps in a fixed, grapheme-correct order, and decide up front whether your output needs to be reversible or script-pure. disarm turns these from tribal knowledge into guarantees: pipeline step order is single-source and invariant-checked, and normalization provably never splits a grapheme cluster.
This page is a set of recipes built from existing functions — it introduces no new API.
Why normalize first¶
The same visible text can be encoded many ways (see Normalization). Preprocessing that runs before normalization — stripping, folding, transliterating, matching — sees those inconsistent encodings and produces inconsistent results. Worse, naive preprocessing can split an Indic conjunct or a combining-mark sequence, or mix scripts, corrupting both security checks and downstream models.
Normalizing first collapses the representations to one canonical form, so every later step operates on stable input.
Guarantee 1 — the step order can't drift¶
TextPipeline always runs its steps in a fixed, optimal order
regardless of the order you pass the arguments — normalization first, the
final whitespace cleanup last:
from disarm import TextPipeline
pipe = TextPipeline(
fold_case=True, # passed first…
normalize="NFKC", # …but normalize always runs first
confusables=True,
collapse_whitespace=True,
)
assert [name for name, _param in pipe.steps] == ['normalize', 'confusables', 'fold_case', 'strip_control', 'strip_zero_width', 'collapse_whitespace']
The order a pipeline reports (pipe.steps) is, by construction, the order it
executes — both read from one shared list inside the engine. A step cannot be
reported at one position and run at another (the class of bug that #141 was). If
you are introspecting a pipeline to audit it, what you see is what runs.
Guarantee 2 — normalization is grapheme-correct¶
Normalization respects grapheme-cluster boundaries. For every form (NFC/NFD/NFKC/NFKD), normalizing the whole string equals normalizing each grapheme cluster independently and rejoining:
import disarm
normalize_whole = lambda s, f: disarm.normalize(s, form=f)
normalize_parts = lambda s, f: "".join(
disarm.normalize(g, form=f) for g in disarm.grapheme_split(s)
)
s = "क्ष" # Devanagari conjunct: KA + virama + SSA
assert normalize_whole(s, "NFC") == normalize_parts(s, "NFC")
use disarm::api::{self, NormalizationForm};
let s = "क्ष"; // Devanagari conjunct: KA + virama + SSA
let whole = api::normalize(s, NormalizationForm::Nfc);
let parts: String = api::grapheme_split(s)
.iter()
.map(|g| api::normalize(g, NormalizationForm::Nfc))
.collect();
assert_eq!(whole, parts); // => true
In plain terms: normalization never orphans a combining mark, never splits an Indic conjunct, and never merges across cluster boundaries. This is verified exhaustively over every Hangul syllable, every Devanagari conjunct, the full combining-diacriticals block, and the whole BMP.
One intended exception to watch for: NFKC/NFKD change the grapheme count by
expanding compatibility characters (the ligature fi becomes fi, two
clusters). That is normalization working as designed, not a boundary violation —
but it is one more reason to choose your form deliberately (below).
If you need to shorten text without cutting a cluster in half, use
grapheme_truncate, which only cuts on boundaries.
Recipe — script purity (one script in, one script out)¶
Mixed-script text is a classic spoofing vector (pаypаl with Cyrillic а).
Detect it with is_mixed_script, and fold it to a single script with
normalize_confusables:
import disarm
raw = "pаypаl" # contains Cyrillic а (U+0430)
# Normalize first — NFKC folds compatibility variants (fullwidth, ligatures)
# so the script check sees canonical input, never a disguised bypass.
s = disarm.normalize(raw, form="NFKC")
assert disarm.is_mixed_script(s) == True
pure = disarm.normalize_confusables(s, target_script="latin")
assert pure == 'paypal'
assert disarm.is_mixed_script(pure) == False
use disarm::api::{self, NormalizationForm, TargetScript};
let raw = "pаypаl"; // contains Cyrillic а (U+0430)
// Normalize first — NFKC folds compatibility variants (fullwidth, ligatures)
// so the script check sees canonical input, never a disguised bypass.
let s = api::normalize(raw, NormalizationForm::Nfkc);
assert!(api::is_mixed_script(&s)); // => true
let pure = api::normalize_confusables(&s, TargetScript::Latin);
assert_eq!(pure, "paypal"); // => "paypal"
assert!(!api::is_mixed_script(&pure)); // => false
- Flag with
is_mixed_scriptwhen you only need to reject suspicious input (e.g. before storing a username). For hostnames,is_suspicious_hostnamereturns per-label mixed-script and confusable details. - Fold with
normalize_confusables(target_script=...)when you want to coerce input to a canonical script for comparison.
Normalize first, then check or fold — confusable detection is most reliable on canonical input.
Recipe — reversibility-preserving canonicalization (use NFC, not NFKC)¶
If you may need to convert text back to its native script later — disarm
supports reverse transliteration for Greek, Russian, and Ukrainian via
transliterate(text, target=lang) — canonicalize with NFC, never NFKC.
NFKC's compatibility folding is lossy and destroys the information a reversal would need:
import disarm
assert disarm.normalize("⁵", form="NFC") == '⁵' # superscript five — preserved
assert disarm.normalize("⁵", form="NFKC") == '5' # folded to ASCII — unrecoverable
use disarm::api::{self, NormalizationForm};
assert_eq!(api::normalize("⁵", NormalizationForm::Nfc), "⁵"); // => "⁵" superscript five — preserved
assert_eq!(api::normalize("⁵", NormalizationForm::Nfkc), "5"); // => "5" folded to ASCII — unrecoverable
An NFC-first canonicalization keeps the door open to a clean round-trip:
native = "Москва"
canonical = disarm.normalize(native, form="NFC") # canonical, lossless
romanized = disarm.transliterate(canonical, lang="ru")
assert romanized == 'Moskva'
back = disarm.transliterate(romanized, target="ru")
assert back == 'Москва' # round-trips
use disarm::api::{self, NormalizationForm, ReverseLang, Transliterate};
let native = "Москва";
let canonical = api::normalize(native, NormalizationForm::Nfc); // canonical, lossless
let romanized = Transliterate::new().lang("ru").run(&canonical);
assert_eq!(romanized, "Moskva"); // => "Moskva"
let back = api::reverse_transliterate(&romanized, ReverseLang::Russian);
assert_eq!(back, "Москва"); // => "Москва" round-trips
For the reversible direction, also avoid the steps that erase recoverable
information — strip_accents, fold_case, and transliteration to ASCII — unless
you keep the original alongside the canonical key.
This is the deliberate counterpart to the security/search canonicalization
recipes (security_clean, catalog_key, search_key), which use NFKC on
purpose: they want the lossy folding so that ⁵, fi, and fullwidth variants
all collapse to one comparison key. Reversibility and aggressive folding are
opposite goals — choose per use case.
Choosing a normalization form¶
| Goal | Form | Why |
|---|---|---|
| Storage, comparison, reversible canonicalization | NFC | Canonical and lossless; preserves the round-trip to native script. |
| Security keys, search keys, dedup | NFKC | Folds compatibility variants (⁵→5, fi→fi, fullwidth→ASCII) into one key — lossy by design. |
| Accent stripping (as an intermediate) | NFD / NFKD | Decomposes so combining marks can be removed; see strip_accents. |
When unsure, normalize with NFC first; reach for NFKC only when you explicitly want compatibility folding and do not need the original back.
See also¶
- Normalization — the forms in depth
- Text Pipeline — composing the steps
- Precompiled Pipelines —
security_clean,catalog_key,search_key, and the policy profiles - Confusable Detection — script and homoglyph analysis