HAI-SDLC Hardening Review (Deep, Full-Tree) — disarm¶
- Date: 2026-06-18
- Commit:
fb5524f(main, active worktree; tree clean apart from CHANGELOG) - Scope: Full tree, no baseline — Rust core (
src/, ~23.3k LOC) + PyO3 (src/py/), Node, Ruby bindings + Python wrapper (python/disarm/, ~5.4k LOC). Analyzed as-is, not as a diff. - Budget profile: Deep — five analysts (Correctness, Security, Performance, Polish, + a dedicated fast-path/idempotency deep-dive), exhaustive module coverage.
- Mode: Analysis only — no source changed. This file is the deliverable.
Method¶
Pass 0 surveyed the current tree, then five analysts ran in parallel: the four lenses across the full module set, plus a dedicated deep-dive on the reworked preset layer's two crown-jewel guarantees (fast-path equivalence, idempotency). Every High/Critical and the two crown-jewel guarantees were then independently verified against source — confusables TSV entries, the fixed-point loop body, the proptest generators, and the guard predicate. Where no Rust toolchain was available, idempotency/equivalence claims were modeled semantically against the real Unicode tables and stated as such.
Verdict¶
The strongest result of the three reviews. No Critical or High security or correctness defect; the highest-severity finding is Medium. Most importantly, the guarantees that were broken in the prior delta review are now verified to hold:
canonicalize/canonicalize_strict/strip_format/strip_obfuscation/sort_keyare raw-idempotent — the #434ConfusablesNfcFixedPointloop iterates confusables→NFC to a fixed point, resolving the"с\u{0327}"→"ç"→"c" cascade within a single call. Verified: the loop body breaks onnxt == cur; thecanonicalize_idempotentproptest now asserts raw equality with\u{0327}/\u{0308}added to its generator (closing the exact "vacuous pass" false-assurance gap I flagged last time), and"c\u{0327}\u{0327}"is pinned deterministically. Thedisplay_clean/strip_formatVS-carve-out (D-2) andsort_keytransliterate-order (D-5) cases are likewise fixed, with regression tests.- The #458–#463 fast-path guard is sound — no input was found that the guard classifies "benign" (returning
Cow::Borrowed, untouched) while the pipeline would actually modify it. The guard's predicates reuse the same classifier functions the strip stages call; the ASCII-confusable set isbuild.rs-generated from the same TSV as the PHF (can't rot out of lockstep); and a base+combining-mark that would compose into a confusable is caught by themarkspredicate. Two independent analysts reached this conclusion. - The H-P4 allocation debt is resolved — all presets route through a shared two-buffer ping-pong runner (#453/#454); no-op stages skip the swap; benign ASCII is genuinely zero-alloc (
Cow::Borrowed).
The actionable findings are perf-margin refinements and documentation polish.
Architecture (Pass 0)¶
Unchanged in shape from prior reviews (layered Layer-1 algorithms → Layer-2 api/ → PyO3/Node/Ruby shims → Python wrapper; build.rs PHF codegen; unsafe forbidden; limits in limits.rs). The preset layer was substantially reworked: presets renamed to mechanism names (canonicalize, canonicalize_strict, strip_format, catalog_key, search_key, sort_key, ml_normalize, strip_obfuscation) with deprecation aliases (#430/#450); all return Cow<str> and run through a Step-enum ping-pong runner behind a byte-mask fast-path guard; confusables iterate to a bounded fixed point.
Findings (all independently verified against source)¶
Severity reflects real-world impact. No Critical or High. Highest is Medium.
Performance¶
P-1 · Fast-path guard orders the expensive normalization predicates before the cheap range checks · Medium (Performance analyst rated High) [VERIFIED]¶
- Location:
src/presets.rsacts_on_nonascii - What: the per-non-ASCII-char predicate chain is
m.transliterate || (m.nfkc && nfkc_changes(ch)) || (m.marks && is_combining_mark(ch)) || (m.strip_accents && decomposes_to_mark(ch)) || (zalgo nfd_mark_run_exceeds) || … || (m.bidi && is_bidi_or_format(ch)) || (m.zero_width && …) || (m.invisible && …). The two costliest predicates —nfkc_changes(ch)anddecomposes_to_mark(ch), which spin up single-scalar NFKC/NFD normalization iterators — run before the cheap pure-rangematches!checks (bidi/zero-width/invisible). For a transliterating presetm.transliterateshort-circuits the whole scan to O(1)/char (great). For the non-transliterating presets (canonicalize,canonicalize_strict,strip_obfuscation,ml_normalizewithout lang),m.transliterateis false, so the first non-ASCII char pays an NFKC-iterator probe. - Why it matters: on benign non-ASCII input (clean CJK/Hangul/accented-Latin through
canonicalize), the guard's per-char cost approaches the first pipeline stage it's trying to skip, so the margin is thin — it still wins for a multi-stage preset, but not by much. (Benign ASCII — the deployment norm — is unaffected: the byte-loop never reachesacts_on_nonascii.) - Severity note: the analyst rated this High; recalibrated to Medium — it's a constant-factor optimization opportunity on one input class, not a regression, correctness, or DoS issue, and the guard is still a net win.
- Fix direction: reorder predicates cheapest-first (pure range checks before the NFKC/NFD iterators); optionally gate the normalization probes behind a coarse BMP-block bitmap (cf. the
transliterate.rsBLOCK_CLASSprecedent) so inert blocks (CJK, Hangul) skip the iterator. - Confidence: High the ordering is suboptimal; magnitude unmeasured (no profiler). Verify with a Criterion bench of
canonicalizeon ASCII vs clean-CJK vs clean-accented-Latin, guarded vswithout_fastpath.
P-2 · Confusables fixed-point loop re-NFCs and re-folds the whole string each iteration (bounded DoS amplifier) · Medium [VERIFIED]¶
- Location:
src/presets.rsStep::ConfusablesNfcFixedPoint(the #434 loop) - What: each iteration runs
normalize_confusables_intothen a full-stringNFCover the result; up toCONFUSABLE_FIXED_POINT_ITERS = 8passes. Buffers are reused across iterations (good — no per-pass alloc), but every pass re-scans and re-normalizes the entire string even though only changed regions can produce new foldable forms. - Why it matters: in the two hottest defense presets. Benign input converges in 1–2 passes (verified: the deep-dive modeled max 5 iterations even for triple-stacked marks, and the loop's
breakfires early). The DoS angle is bounded: an adversarial duplicate-mark string can force more passes, but the cap is a hard 8× linear constant, not unbounded — hence Medium, and the Zalgo(2) cap before the loop limits realistic depth to ~4. - Fix direction: skip the NFC pass when the confusables pass changed nothing; or fold-aware incremental NFC over only the touched region. Confirm the worst-case iteration count with an instrumented bench.
- Confidence: High on the mechanism; the 8× worst case requires a deliberately near-non-converging input.
P-3 · sort_key's transliterate_preserving_latin re-pays per-run transliterate setup + a per-char script binary search · Medium [plausible, not independently re-verified]¶
- Location:
src/presets.rstransliterate_preserving_latin_into - What: walks char-by-char calling
scripts::detect_char_script(O(log n) binary search per char) and invokestransliterate_implonce per non-Latin run, each call re-resolving the lang map and re-running the 256-char capacity estimator, plus a per-runStringthe ping-pong can't absorb. Bites mixed-script titles — exactlysort_key's use case. - Fix direction: replace the per-char full script lookup with a cheap 3-way Latin/Common/Inherited range check; use a lighter capacity hint for short runs.
- Confidence: Medium (well-tested for correctness; cost only on mixed-script input).
(Lows: run always to_owned()s past the guard even for inputs the pipeline leaves unchanged; pad_emoji_replacement/VS-keep call chars().next_back() per item; is_windows_reserved allocates an uppercase String up to 3×/call. All negligible.)
Correctness¶
C-1 · Binary-searched metadata.rs arrays have no sortedness guard · Medium [VERIFIED]¶
- Location:
src/metadata.rs(lang():1339,script():1347,binary_search_by_key);SCRIPTSinapi/metadata.rs - What:
LANGS,SCRIPTS_META,SCRIPTSare binary-searched butmetadata.rshas no test module and no sortedness assertion — unlike the analogousBUILTIN_LANGS, which is guarded bybuiltin_langs_is_sorted(tables/mod.rs:859, verified). The arrays are currently sorted, so this is latent risk, not an active miscompute: ifgen_metadata.pyever emits an out-of-order row,lang_info/script_infowould silently return wrong/missing metadata with no test catching it. - Fix direction: add a
#[test]orbuild.rsconst-assertion that each array is strictly sorted by key, mirroringbuiltin_langs_is_sorted. - Confidence: High the guard is absent and the convention is established elsewhere.
(Lows: reverse.rs uses a byte-width window for a documented char-width key bound — safe today because all reverse keys are ASCII, fragile if a non-ASCII key is ever added; emoji.rs is_emoji_codepoint silently skips the U+1FB00–1FBFF Legacy Computing block — correct but undocumented; strip_zalgo "kept marks" count is pre-NFC — intended, a doc-clarity note.)
Security¶
No High/Critical, in scope or otherwise. The panic surface is disciplined (untrusted .bin dict parsing in context.rs is fully fallible with bounds-checked reads; slicing is char-boundary-clamped; FFI lengths validated via checked_size; unsafe forbidden), the neutralization controls cover their claimed inputs (log-injection C0/C1/NEL/LS/PS; filename double dot-collapse + pre/post-truncation reserved-name handling; the invisibles emoji-flag carve-out validates against an exact RGI subdivision allowlist ["gbeng","gbsct","gbwls"], not just shape — stronger than the prior partial check, closing the old D-6), and the fast-path guard was independently confirmed not to skip any class needing cleaning.
- S-1 (Low):
compile_regex(src/slugify.rs:33) sets.size_limit(MAX_REGEX_DFA_BYTES)but not.dfa_size_limit(...), so the match-time lazy-DFA cache uses the regex-crate default rather than the project constant. Pattern length is pre-capped and compile errors are typed (no panic), so this is symmetry hardening, not a hole. - S-2 (Low/info):
THREAT_MODEL.mdsays slugify is "bounded by an input-size cap" — the bound is on pattern size, not input. Reconcile wording. - S-3 (Low/info): stale secondary lockfile
bindings/ruby/ext/disarm/Cargo.lockpinsmagnus 0.7.1while the active build uses 0.8.2 — cosmetic drift; regenerate to avoid a misleading future audit. - S-4 (Low): no binding-scoped no-panic clippy gate (
-D clippy::unwrap_used,indexing_slicing) to structurally lock in the "shims never panic across FFI" property the code currently upholds by hand. - Suggested test hardening: add a base+combining-mark dimension to the fast-path equivalence proptest generator — the soundness for that class currently rests on the
marksargument, not on a test that exercises a composing pair.
Polish¶
Pol-1 · Node collapseWhitespace doc invents a non-existent options param and states the opposite of the real behavior · Medium (security-relevant) [VERIFIED]¶
- Location:
docs/node/api.md:132-136vsbindings/node/index.ts:238 - What: the doc signature is
collapseWhitespace(text, options?)and claims it "By default also strips control and zero-width characters (options.stripControl/options.stripZeroWidth, bothtrue)." The real export takes onlytext— no options — and its own JSDoc says it "does NOT delete control or zero-width characters — usestripControlChars/stripZeroWidthCharsfor that." The doc describes the exact opposite of reality. - Why it matters: a Node user trusting the doc believes control/zero-width chars are stripped when they are not — a security-relevant false assurance on a public npm surface, and passing
{stripControl:false}is silently ignored. Highest-value doc defect found. - Fix direction: rewrite to
collapseWhitespace(text); drop the options claim; point tostripControlChars/stripZeroWidthChars, matching the JSDoc. - Confidence: High (both files read).
Pol-2 · strip_obfuscation has no section in the pipelines reference · Medium [analyst-verified]¶
- Location:
docs/api/pipelines.md(jumps fromcanonicalize_stricttoPRESETS) - What: every other preset gets a
## <name>section with steps + examples;strip_obfuscation— a root-exported, README-featured preset — has none. Add a section mirroring the others.
Pol-3 · Python Text fluent builder omits canonicalize_strict / strip_obfuscation (+ the normalize_user_input alias) · Medium [analyst-verified]¶
- Location:
python/disarm/_text.py:258-305,_text.pyi - What:
Textexposescanonicalize/strip_format(and their aliases) but notcanonicalize_strictorstrip_obfuscation, both of which exist at module level — so the arguably-primary "normalize untrusted input" preset isn't chainable while less safety-critical siblings are. Add them (with the deprecatednormalize_user_input) + parity tests, or document the omission deliberately.
(Lows: two _presets.py docstrings still steer users to the deprecated display_clean instead of strip_format; the Rust sort_key doc-summary omits the second FoldCase + terminal NFC that the STEPS array two lines below includes; deprecated aliases lack a warning-emission test; cross-binding alias asymmetry — Node/Ruby carry only security_clean, defensible since they never exposed the other two names but worth a one-line note.)
Fast-path / idempotency deep-dive¶
Verdict: fast-path equivalence HOLDS; idempotency HOLDS for all five presets. One latent-fragility note worth recording:
- FP-1 (Low, latent — not a current bypass): the fold_case predicate gates on
ch.is_alphabetic()(std's Unicode-Alphabetic), but the actual fold uses thecase_folding.tsvtable, which can fold non-alphabetic chars (circled capitals, Roman numerals) and may track a different Unicode version. Not exploitable today because every preset that setsfold_casealso setstransliterate, which short-circuitsacts_on_nonasciifirst, sois_alphabetic()is never the sole gate for any non-ASCII char in any shipped preset (and circled letters are independently backstopped bynfkc_changes). Fix by matching the predicate to the fold table directly (case_folding_data::lookup(ch).is_some()) or abuild.rsassertion, to remove the coupling before some future fold-only preset relies on it. - Test-assurance notes:
strip_obfuscation_idempotentasserts onlynfc(once)==nfc(twice)where the code actually provides raw idempotency — weaker than the raw-equality assertions its four peers use; tighten for parity. The tier-3 exhaustive fast-path audit (#[ignore]d by design) doesn't cover the plane-1 foldable-letter blocks (U+10570 Vithkuqi, U+10D50 Garay) — irrelevant today given the transliterate backstop, but a blind spot if that coupling changes.
Previously-broken guarantees now verified fixed¶
For continuity (the prior two reviews are in this folder): D-1 (canonicalize/normalize_user_input non-idempotent) → fixed by #434's fixed-point loop, verified; D-2 (strip_format VS carve-out) → fixed, regression-tested; D-5 (sort_key transliterate-order, incl. transliteration that emits uppercase) → fixed with a second fold_case + test; D-6 (emoji-flag smuggling bypass) → fixed with an exact RGI allowlist; H-P4 (preset per-stage allocation) → fixed via the ping-pong runner; H-D1/H-D2/H-D3/M-D3 and the first review's error.rs items → all fixed (the slugify-lang docs now state best-effort honestly; Node getPipeline is surfaced with a Policy-pipelines section; strip_log_injection absence is noted in both bindings; Error::code/kind carry #[must_use]). The proptest generator that previously passed D-1 vacuously now includes the triggering combining marks.
Recommended priority order¶
- Pol-1 — the Node
collapseWhitespacedoc inversion; cheap and the only finding with a (false-assurance) security flavor. - C-1 — add the
metadata.rssortedness guard; one test, closes a silent-failure class the codebase already guards elsewhere. - P-1 / P-2 — reorder the guard predicates cheapest-first and short-circuit the fixed-point loop's NFC when nothing changed; the two real perf refinements (constant-factor, no correctness risk).
- Pol-2 / Pol-3 — pipelines-doc section for
strip_obfuscation; complete theTextbuilder. - FP-1 / S-1 / S-4 and the Lows — decouple the fold_case predicate from
is_alphabetic;.dfa_size_limit(); binding no-panic gate; per the Boy-Scout rule as the areas are touched.
Coverage¶
All of src/ was read across the five analysts (every Layer-1 module read in full by at least one lens; presets.rs read in full by three including the dedicated deep-dive), plus build.rs codegen, the binding shims (Node/Ruby lib.rs + index.ts/disarm.rb), the Python wrapper preset/slug surface, docs/{node,ruby,api}/, and the proptests/regressions. Not exhaustively read (data, not logic): the generated PHF/trie table modules and the per-language transliteration TSV rows. No build or test run was performed — analysis only; idempotency/equivalence were verified by source inspection + semantic modeling against the real Unicode tables.
Posture¶
This is a mature, unusually well-hardened library that has visibly absorbed two prior hardening passes: the idempotency and fast-path-bypass risks that were open or broken before are now closed and, crucially, backed by raw-equality proptests whose generators actually reach the trigger classes. The fast-path guard — the highest-risk recent optimization, since a misclassification would silently skip the security cleaner — is built on conservative, lockstep-generated predicates and exhaustive equivalence tests, and survived a dedicated adversarial deep-dive. The residual work is genuine but minor: a per-char predicate-ordering refinement and a bounded fixed-point-loop cost on the new fast path (both constant-factor, no correctness risk), a missing sortedness invariant on three metadata arrays, and a handful of documentation-sync items led by one Node doc that inverts a function's real behavior. Nothing here blocks a release.
Generated by the HAI-SDLC hardening sequence — deep full-tree run, analysis only. No source files were modified. Findings cite path:line at fb5524f. Every High/Critical (none found) and both crown-jewel guarantees were independently verified against source during this run.