Normalize-First Canonicalization¶

Put Unicode normalization at the front of every text pipeline, run the remaining steps in a fixed, grapheme-correct order, and decide up front whether your output needs to be reversible or script-pure. disarm turns these from tribal knowledge into guarantees: pipeline step order is single-source and invariant-checked, and normalization provably never splits a grapheme cluster.

This page is a set of recipes built from existing functions — it introduces no new API.

Why normalize first¶

The same visible text can be encoded many ways (see Normalization). Preprocessing that runs before normalization — stripping, folding, transliterating, matching — sees those inconsistent encodings and produces inconsistent results. Worse, naive preprocessing can split an Indic conjunct or a combining-mark sequence, or mix scripts, corrupting both security checks and downstream models.

Normalizing first collapses the representations to one canonical form, so every later step operates on stable input.

Guarantee 1 — the step order can't drift¶

TextPipeline always runs its steps in a fixed, optimal order regardless of the order you pass the arguments — normalization first, the final whitespace cleanup last:

from disarm import TextPipeline

pipe = TextPipeline(
    fold_case=True,           # passed first…
    normalize="NFKC",         # …but normalize always runs first
    confusables=True,
    collapse_whitespace=True,
)

assert [name for name, _param in pipe.steps] == ['normalize', 'confusables', 'fold_case', 'strip_control', 'strip_zero_width', 'collapse_whitespace']

The order a pipeline reports (pipe.steps) is, by construction, the order it executes — both read from one shared list inside the engine. A step cannot be reported at one position and run at another (the class of bug that #141 was). If you are introspecting a pipeline to audit it, what you see is what runs.

Guarantee 2 — normalization is grapheme-correct¶

Normalization respects grapheme-cluster boundaries. For every form (NFC/NFD/NFKC/NFKD), normalizing the whole string equals normalizing each grapheme cluster independently and rejoining:

PythonRust

import disarm

normalize_whole = lambda s, f: disarm.normalize(s, form=f)
normalize_parts = lambda s, f: "".join(
    disarm.normalize(g, form=f) for g in disarm.grapheme_split(s)
)

s = "क्ष"  # Devanagari conjunct: KA + virama + SSA
assert normalize_whole(s, "NFC") == normalize_parts(s, "NFC")

use disarm::api::{self, NormalizationForm};

let s = "क्ष"; // Devanagari conjunct: KA + virama + SSA
let whole = api::normalize(s, NormalizationForm::Nfc);
let parts: String = api::grapheme_split(s)
    .iter()
    .map(|g| api::normalize(g, NormalizationForm::Nfc))
    .collect();
assert_eq!(whole, parts); // => true

In plain terms: normalization never orphans a combining mark, never splits an Indic conjunct, and never merges across cluster boundaries. This is verified exhaustively over every Hangul syllable, every Devanagari conjunct, the full combining-diacriticals block, and the whole BMP.

One intended exception to watch for: NFKC/NFKD change the grapheme count by expanding compatibility characters (the ligature ﬁ becomes fi, two clusters). That is normalization working as designed, not a boundary violation — but it is one more reason to choose your form deliberately (below).

If you need to shorten text without cutting a cluster in half, use grapheme_truncate, which only cuts on boundaries.

Recipe — script purity (one script in, one script out)¶

Mixed-script text is a classic spoofing vector (pаypаl with Cyrillic а). Detect it with is_mixed_script, and fold it to a single script with normalize_confusables:

PythonRust

import disarm

raw = "pаypаl"                     # contains Cyrillic а (U+0430)

# Normalize first — NFKC folds compatibility variants (fullwidth, ligatures)
# so the script check sees canonical input, never a disguised bypass.
s = disarm.normalize(raw, form="NFKC")

assert disarm.is_mixed_script(s) == True

pure = disarm.normalize_confusables(s, target_script="latin")
assert pure == 'paypal'
assert disarm.is_mixed_script(pure) == False

use disarm::api::{self, NormalizationForm, TargetScript};

let raw = "pаypаl"; // contains Cyrillic а (U+0430)

// Normalize first — NFKC folds compatibility variants (fullwidth, ligatures)
// so the script check sees canonical input, never a disguised bypass.
let s = api::normalize(raw, NormalizationForm::Nfkc);

assert!(api::is_mixed_script(&s)); // => true

let pure = api::normalize_confusables(&s, TargetScript::Latin);
assert_eq!(pure, "paypal");        // => "paypal"
assert!(!api::is_mixed_script(&pure)); // => false

Flag with is_mixed_script when you only need to reject suspicious input (e.g. before storing a username). For hostnames, is_suspicious_hostname returns per-label mixed-script and confusable details.
Fold with normalize_confusables(target_script=...) when you want to coerce input to a canonical script for comparison.

Normalize first, then check or fold — confusable detection is most reliable on canonical input.

Recipe — reversibility-preserving canonicalization (use NFC, not NFKC)¶

If you may need to convert text back to its native script later — disarm supports reverse transliteration for Greek, Russian, and Ukrainian via transliterate(text, target=lang) — canonicalize with NFC, never NFKC.

NFKC's compatibility folding is lossy and destroys the information a reversal would need:

PythonRust

import disarm

assert disarm.normalize("⁵", form="NFC") == '⁵'    # superscript five — preserved
assert disarm.normalize("⁵", form="NFKC") == '5'   # folded to ASCII — unrecoverable

use disarm::api::{self, NormalizationForm};

assert_eq!(api::normalize("⁵", NormalizationForm::Nfc), "⁵");   // => "⁵"  superscript five — preserved
assert_eq!(api::normalize("⁵", NormalizationForm::Nfkc), "5");  // => "5"  folded to ASCII — unrecoverable

An NFC-first canonicalization keeps the door open to a clean round-trip:

PythonRust

native = "Москва"
canonical = disarm.normalize(native, form="NFC")        # canonical, lossless
romanized = disarm.transliterate(canonical, lang="ru")
assert romanized == 'Moskva'
back = disarm.transliterate(romanized, target="ru")
assert back == 'Москва'                                   # round-trips

use disarm::api::{self, NormalizationForm, ReverseLang, Transliterate};

let native = "Москва";
let canonical = api::normalize(native, NormalizationForm::Nfc); // canonical, lossless
let romanized = Transliterate::new().lang("ru").run(&canonical);
assert_eq!(romanized, "Moskva");                                // => "Moskva"
let back = api::reverse_transliterate(&romanized, ReverseLang::Russian);
assert_eq!(back, "Москва");                                     // => "Москва"  round-trips

For the reversible direction, also avoid the steps that erase recoverable information — strip_accents, fold_case, and transliteration to ASCII — unless you keep the original alongside the canonical key.

This is the deliberate counterpart to the security/search canonicalization recipes (canonicalize, catalog_key, search_key), which use NFKC on purpose: they want the lossy folding so that ⁵, ﬁ, and fullwidth variants all collapse to one comparison key. Reversibility and aggressive folding are opposite goals — choose per use case.

Choosing a normalization form¶

Goal	Form	Why
Storage, comparison, reversible canonicalization	NFC	Canonical and lossless; preserves the round-trip to native script.
Security keys, search keys, dedup	NFKC	Folds compatibility variants (`⁵→5`, `ﬁ→fi`, fullwidth→ASCII) into one key — lossy by design.
Accent stripping (as an intermediate)	NFD / NFKD	Decomposes so combining marks can be removed; see `strip_accents`.

When unsure, normalize with NFC first; reach for NFKC only when you explicitly want compatibility folding and do not need the original back.

Normalize-First Canonicalization¶

Why normalize first¶

Guarantee 1 — the step order can't drift¶

Guarantee 2 — normalization is grapheme-correct¶

Recipe — script purity (one script in, one script out)¶

Recipe — reversibility-preserving canonicalization (use NFC, not NFKC)¶

Choosing a normalization form¶

See also¶