HAI-SDLC Hardening Review — disarm¶
- Date: 2026-06-17 (17:52 UTC)
- Commit:
fd7a54f(main, clean working tree) - Scope: Rust core (
src/) + all bindings (PyO3src/py/, Nodebindings/node/, Rubybindings/ruby/, Python wrapperpython/disarm/) - Budget profile: Standard
- Mode: Analysis only — no code was changed. This file is the sole deliverable.
Resolution status (2026-06-18 follow-up)¶
A post-0.11 hardening pass actioned this report.
Fixed: H-P1 (
fold_case_cowsingle pass), H-P2 (is_zalgoearly-exit), H-P3 (strip_zalgofast path, with an NFC property test), H-P5 (grapheme_widthASCII fast path), H-D1 (sluglangdoc), H-D2 (Python slugRaises), H-D3 (NodegetPipelinedoc + a new Policy-pipelines section), M-C1 (slug trailing partial multi-char separator, + proptest), M-P3 (detect_scriptslinear dedup), M-P4 (truncate_to_graphemesresult-sized reserve), M-P5 (strip_control/strip_zero_widthreserve), M-D1 (staleallow(dead_code)), M-D2 (Error::code/kind#[must_use]), M-D3 (Ruby/Node "not surfaced" notes incl.strip_log_injection), L-D1 (deprecation docstrings "when explicitly passed"), L-P1 (slug regex no-matchCow).Stale: L-C1 — the
collapse_whitespacestrip_control=falsemode it cites was removed by the #433 whitespace split.Deferred (not a re-architecture, but low-value or not a clear win): H-P4 (preset ping-pong reroute — structural; see the delta report), M-P1 (
confusablesmaps ASCII sources by design, so no no-op fast path is correct), M-P2 (Pipeline::processreturns an ownedString, so the up-frontto_ownedis the required result, not waste), L-P2/L-P3/L-P5 (per-call caching — adds state for Low value), L-P4 (the 10% fold over-reserve is a deliberate, documented anti-realloc tradeoff), L-C2 (CLDR uses": "as its base/qualifier delimiter, so the first-match cut is intended; a change needs a table audit), L-S1/L-S2 (informational — cannot flip the fail-closed hostname verdict).
How this run was conducted¶
Pass 0 surveyed the architecture, then four analysts ran in parallel — Correctness (passes 1–3), Harden/Security (4–5), Optimise/Performance (6–7), Polish (8–10). Every High/Critical finding below was then independently re-checked against the source by reading the cited lines; the verification result is recorded inline. One severity was recalibrated during verification (the performance "Critical" is a linear 2× constant factor, not a super-linear blow-up, so it is reported as High).
Architecture (Pass 0)¶
disarm is a layered Unicode text-security / canonicalization library:
- Layer 1 — algorithm modules in
src/*.rs(transliterate, scripts, confusables, normalize, slugify, filename, hostname, log_injection, zalgo, anomalies, emoji, width, grapheme, whitespace, encoding, reverse, case_fold, context, pipeline, presets). - Layer 2 — the idiomatic, pyo3-free public Rust API in
src/api/plus theDisarmStrextension trait. This is the one place public Rust behaviour is defined. - Binding shims — PyO3 (
src/py/), Node/napi (bindings/node/src/lib.rs), Ruby/magnus (bindings/ruby/ext/disarm/src/lib.rs), and the Python wrapper (python/disarm/). All consume the same Layer-1 core. - Codegen —
build.rs(~900 lines) builds PHF tables from TSV insrc/tables/data/, with compile-time ASCII-only assertions on table values.
Security posture is strong by construction: unsafe_code = "forbid" (zero
unsafe in the core), centralized resource limits in src/limits.rs, a documented
THREAT_MODEL.md, and atomic "has any registration?" gates that keep locks off the
common path. The trust boundary is untrusted text input; the headline risk class for
a sanitization library is panic-on-adversarial-input (DoS) and neutralization bypass.
Overall verdict¶
No blocking issues. No Critical or High security or correctness defects. The
codebase is unusually well-hardened: the ~202 .unwrap() / 25 .expect() / 7
panic!-family sites are overwhelmingly test-only, and the handful in production
paths are guarded or bounds-checked to provable infallibility. Byte-slicing is
uniformly ASCII-anchored or char_boundary-guarded; injection-neutralization
controls (log-injection, bidi stripping, filename sanitization) are complete against
the character-level vectors they claim, with matching doc/impl/regression-test triples.
The real, actionable findings cluster in two areas:
- Performance headroom in the preset pipelines and a few secondary modules —
per-stage allocation in the presets, and missing early-exit / fast-path in
fold_case,zalgo, andwidth. Two of these are DoS-adjacent on long input. - Documentation-vs-code drift — three public-doc claims that describe behaviour the code does not implement (the highest-value polish items, because users trust them).
Recommended priority order is at the end.
Findings¶
Severity reflects real-world impact, not theoretical reachability. [VERIFIED]
means I re-read the cited source and confirmed the finding during this run.
Performance (Optimise)¶
H-P1 · fold_case_cow double-scans and double-probes the PHF on the owned path · High [VERIFIED]¶
- Location:
src/case_fold.rs:31-41(fold_case_cowdetect scan) +:54-69(fold_case_intofold scan) - What:
fold_case_cowrunschars().any(|c| … lookup(c).is_some())over the whole string (one PHF probe per non-ASCII char), and on a hit callsfold_case_impl→fold_case_into, which scans again with the same probe. Every foldable char is looked up twice. The borrowed (zero-alloc) fast path is fine. - Why it matters: 2× CPU and 2× PHF work on exactly the most expensive inputs
(large all-foldable text — Greek/Cyrillic uppercase,
ßruns).fold_caseis a documented per-call hot path and backsDisarmStr::fold_case. - Fix direction: Fold in a single pass: build optimistically and track whether
any char differed; return
Cow::Borrowedif not. Or havefold_case_intoreturn achangedbool so detect and fold fuse. - Severity note: The Optimise analyst rated this Critical; verification confirms the double-probe but it is a linear 2× constant factor (not super-linear), so it is reported as High.
- Confidence: High. Verify with a bench on ~1 MB of
Ω.
H-P2 · is_zalgo cannot early-exit — always walks the whole NFD stream · High [VERIFIED]¶
- Location:
src/zalgo.rs:57-63(is_zalgo) →:31-46(max_combining_run) - What:
is_zalgoonly needs "does any run exceedthreshold?" butmax_combining_runcomputes the maximum run over the entire string before comparing. No break on first violation past the ASCII gate. - Why it matters: DoS shape — a 4-mark zalgo burst prepended to a 10 MB benign tail forces a full NFD walk of all 10 MB though the verdict was settled in the first few chars. Perf + DoS.
- Fix direction: Give
is_zalgoa streaming loop overtext.nfd()that returnstruethe instantcurrent_run > threshold. - Confidence: High. Verify:
is_zalgo("a" + "\u{0300}"*4 + huge_tail)scales with tail length today; should be ~constant after.
H-P3 · strip_zalgo_into runs NFD→temp→NFC unconditionally for all non-ASCII · High [VERIFIED, with correctness caveat]¶
- Location:
src/zalgo.rs:83-109(tempStringat:91,out.extend(filtered.nfc())at:108) - What: Any non-ASCII input pays a full NFD decompose into a heap temp, then a
full NFC recompose, plus an input-sized temp allocation — even with zero marks (plain
CJK, emoji,
café). The fast path covers only pure ASCII. - Why it matters: Most non-ASCII text needs no stripping yet pays the two most
expensive ops in the module plus an extra allocation. Hit on every
strip_obfuscation/normalize_user_inputcall. Perf + DoS-adjacent. - Fix direction: Add a streaming pre-check for "any run > max_marks"; if none, skip the round-trip.
- ⚠ Caveat surfaced during verification: The documented contract is "operates in
NFD space and recomposes to NFC," so the output is always NFC. A naive
"no excess marks →
push_strthe input verbatim and return" optimization would skip normalization and return non-NFC input unchanged — a behaviour change. The early-out is only safe if the input is already NFC (or you keep NFC-ing the result). Implement the fast path so it preserves the NFC guarantee, and add a property test assertingstrip_zalgo(x)is always NFC. - Confidence: High that the round-trip is unconditional for non-ASCII.
H-P4 · Preset pipelines allocate a fresh String per stage · High [VERIFIED]¶
- Location:
src/presets.rs—security_clean:147,ml_normalize:173,catalog_key:235,search_key:280,sort_key:376,normalize_user_input:431,strip_obfuscation:482 - What: Each preset chains the returning forms (
nfkc_normalize,strip_bidi,normalize_confusables,transliterate_impl().into_owned(),strip_accents,fold_case,collapse_whitespace, …); every stage allocates and returns a newStringand the previous is dropped — even no-op stages.Pipeline::process(src/pipeline.rs:214) already solved this with a two-buffer ping-pong (#236 item 7); the presets never adopted it. (Verifiedsecurity_cleanandsearch_keydirectly.) - Why it matters:
search_key/sort_key/catalog_keyare the documented per-call short-string hot paths indocs/performance.md. A 6-stage preset does ~6 heap allocs + 6 full copies where the engine does ~2. On short inputs the per-call allocation count dominates the boundary-crossing budget the docs optimize for. - Fix direction: Refactor presets onto the ping-pong (the
*_intovariants already exist for normalize/strip_accents/fold_case/confusables/strip_bidi/collapse/zalgo/ demojize), or express them asPipelinespecs routed throughprocess. - Confidence: High. Verify by counting allocations through
search_key(...)vs an equivalentPipeline::process.
H-P5 · grapheme_width_opts pays two binary searches per ASCII char (no ASCII fast path) · High [VERIFIED]¶
- Location:
src/width.rs:84-93—base_emoji(EMOJI_PRESENTATION_RANGESsearch) andbase_class(width_class) computed eagerly - What: For every cluster, both the emoji-presentation range search and the
width-class search run, even for ASCII/Latin-1 bases (
cp < 0x300) that can never be emoji or wide.terminal_width_optscalls this per cluster. No early-out at the top. - Why it matters: For long pure-ASCII input (identifiers, URLs, usernames — the dominant input for this library), every char costs ~2·log₂(table) comparisons instead of O(1). Constant-factor multiplier on long/adversarial input.
- Fix direction: Top-of-function guard:
if (base as u32) < 0x300 && rest.is_empty() { return ascii_width }, skipping both searches. - Confidence: High. Verify with a Criterion bench on a long ASCII string.
Medium / Low performance items (not individually re-verified — analyst confidence noted)¶
- M-P1
confusables::normalize_confusables_intohas no borrowed/no-op fast path; thePipelineCONFUSABLES step (pipeline.rs:291) always returnsOk(true)and full-copies even on no-op input, unlike the TRANSLITERATE branch. (Medium, High confidence) - M-P2
Pipeline::processdoestext.to_owned()up front (pipeline.rs:224) before any step decides to mutate — wasteful on empty/ASCII-passthrough pipelines. (Medium) - M-P3
detect_scriptsbuilds aHashSetper call (scripts.rs:9-21) to dedup ≤~3 scripts; a linearVec::containsavoids the alloc. (Medium, High confidence) - M-P4
truncate_to_graphemesreserves the full input length (grapheme.rs:43) regardless ofmax_graphemes— O(len) memory to keep a tiny result. (Medium) - M-P5
strip_control_chars[_into]/strip_zero_width_chars_into(non-ASCII branch) lackreserve(whitespace.rs:71-84,:94-103). (Medium) - L-P1 slugify step-5 custom-regex stage force-allocates on no match (
slugify.rs:533-535, also:437) —replace_allalready returns a borrowingCow. (Low) - L-P2 slugify rebuilds the
safe_charsHashSetper call (slugify.rs:546-551) instead of caching like the stopword set. (Low) - L-P3
filter_stopwordssave_orderbranch materializes a fullVec<&str>+ triple walk (slugify.rs:622-631). (Low) - L-P4
fold_case_intoover-reserves 10% on every non-ASCII input (case_fold.rs:54-57). (Low) - L-P5
list_langsclones + re-sorts all builtins per call even when nothing is registered (tables/mod.rs:~438). (Low)
Polish (documentation & API)¶
H-D1 · Rust doc asserts a Python slugify lang-validation that does not exist · High [VERIFIED]¶
- Location:
src/api/text.rs:279-281(doc onapi::slugify), vssrc/slugify.rs:357(with_lang"best-effort; not validated") andpython/disarm/_api.py(slugifypasseslangstraight to the core) - What: The doc says "This differs from the Python
slugify, whose convenience wrapper eagerly validateslangand raises." It does not — Python forwardslangunvalidated to the core, which silently falls back to the default transliterator for an unknown code. No binding raises on a bad sluglang. The doc is internally contradictory (it correctly says Rust doesn't validate, then wrongly says Python does). - Why it matters: A user relies on a
ValueError/DisarmErrorthat never fires — a silent-fallback footgun in exactly the spot the doc advertises as a guardrail. - Fix direction: State that both Rust and Python treat slug
langas best-effort/unvalidated; point atlist_langs()for a pre-check. - Confidence: High (all three sites verified).
H-D2 · Python slugify docstring lists an exception it never raises · High [VERIFIED]¶
- Location:
python/disarm/_api.py:576-577 - What: The
Raises:block saysDisarmError: … invalid regex_pattern or unknown lang code. An unknownlangdoes not raise (per H-D1); onlyregex_patterndoes. - Fix direction: Drop "or unknown
langcode" from theRaises:clause; keep theregex_patterncase. - Confidence: High.
H-D3 · Node docs say getPipeline is "not yet surfaced" — but it is exported · High [VERIFIED]¶
- Location:
docs/node/api.md:351-352(and the subset note at:8) vsbindings/node/index.ts:286(export function getPipeline) andbindings/node/src/lib.rs:596(get_pipelineshim) - What: The Stability note disclaims "not the full
getPipelineregistry", butgetPipeline(profile)and thePipelineclass are fully exported and implemented in the Node package. There is also no## Pipelinereference section indocs/node/api.mdfor this shipped, public API. (mlNormalizeand the fluentTextbuilder genuinely are Node-absent — verified — so leave those in the note.) - Why it matters: Node users are told to "reach for another binding" for functionality that ships in the Node package. Clearest doc-vs-code defect in the binding docs.
- Fix direction: Remove
getPipelinefrom the "not yet surfaced" list and add a## Policy pipelinessection documentinggetPipeline(profile)andPipeline#process. - Confidence: High.
Medium / Low polish items¶
- M-D1 Stale
#[allow(dead_code)]+ inaccurate comment onErrorRepr::code()(src/error.rs:445-451): the method is reachable via the publicError::code()(:700, exercised atapi/mod.rs:258), so theallowsuppresses nothing and the comment misdescribes the wiring. Boy-Scout fix. (Medium, High confidence) - M-D2 Public
Error::code()(error.rs:700) andError::kind()(:648) lack#[must_use], unlike the rest of the Layer-2 surface. (Medium) - M-D3 The output-encoder / encoding / log-injection / pipeline family (
escape_html,percent_encode,detect_encoding,decode_to_utf8,strip_log_injection,display_clean,ml_normalize,normalize_user_input,list_profiles,list_langs,reverse_langs,is_ascii) is Python-exposed but absent from Node and Ruby. Node has a (now partly stale) note; Ruby has no "not surfaced" note at all, so the omission — including the security-relevantstrip_log_injection— is undocumented. Either surface the high-value ones or add an accurate "not surfaced in this binding" note todocs/ruby/api.md(and sync the Node note). (Medium, High confidence on the gap) - L-D1
is_confusable/demojizedeprecation docstrings (_api.py:1655-1657) read as if theDeprecationWarningalways fires, but the code only warns when the removed param is explicitly passed (the correct, quieter behaviour). Add "when explicitly passed". (Low)
Correctness¶
M-C1 · slugify word-boundary truncation can leave a trailing partial/whole separator (custom multi-char separator only) · Medium [VERIFIED]¶
- Location:
src/slugify.rs:657-667(truncate_at_boundary), reached from:592-594 - What: The non-
word_boundarypath (:596-602) explicitly strips a trailing separator afterfloor_char_boundarytruncation. Theword_boundarypath (truncate_at_boundary) does not: whentruncated.rfind(separator)returnsNoneit returnstruncatedverbatim. With a multi-byte separator,floor_char_boundarycan cut inside the separator (e.g. separator"--", slug"ab--cd",max_length=3→"ab-"), andrfind("--")then misses, yielding a trailing"-". Verified: the default single-char"-"is unaffected (a trailing single-char separator is always re-found byrfindand cut), so this manifests only with a custom multi-char separator — which is why theslugify_output_charsetproptest (default separator only) doesn't catch it. - Why it matters: Violates the "no trailing separator" slug invariant and breaks re-slugify idempotence for callers using a multi-char separator.
- Fix direction: After
truncate_at_boundary, strip a trailing (possibly partial) separator the same way the non-word-boundary branch does; extend the proptest to a multi-char separator. - Confidence: Medium-High. Reproduce with
SlugConfig { separator: "--", max_length: 3, word_boundary: true, .. }on input slugifying to"ab--cd".
Low correctness items¶
- L-C1
collapse_whitespacedoesn't reset the space-run state across a preserved control char (whitespace.rs:43-48,strip_control=falsemode):"a \x01 b"→"a \x01b", dropping a separating space. Security presets always strip controls, so not a security concern. (Low) - L-C2
strip_modifier_suffixcuts the CLDR name at the first": "(emoji.rs:286-293), which over-truncates any base name containing": "; only the opt-instrip_modifiers=truepath. Worth a quick scan of the emoji table for affected names. (Low)
Security findings (Harden)¶
No Critical or High vulnerabilities. The four hardening areas flagged at the start (panic-as-DoS, resource amplification, injection-neutralization, binding-boundary safety) are each already addressed with deliberate, tested controls (verified during Pass 0 and the security pass):
- Panic surface: essentially the entire 200+
unwrapcount is#[cfg(test)]. The non-test production unwraps in the transform modules are provably guarded —transliterate.rs:634(is_mapped-gated),context.rs:93,104(slice already bounds-checked to exactly 2/4 bytes; the binary dict parser fails closed on malformed input),slugify.rs:466/reverse.rs:95(loop-invariant /if letguarded). - Amplification: the one real vector (caller-registered replacement values) is
capped incrementally during construction at
tables/mod.rs:672, so the over-limit buffer is never fully realized.MAX_BATCH_SIZE,MAX_REGEX_PATTERN_BYTES,MAX_REGEX_DFA_BYTES(RegexBuilder::size_limit), and the registration caps are all enforced at the entry points. - Injection neutralization:
log_injection.rscovers C0/C1 (so NEL U+0085, CSI U+009B), LS U+2028, PS U+2029, DEL, and validates the replacement can't reintroduce a neutralized char;strip_bidicovers the full UAX#9 set + ALM + soft hyphen + deprecated format controls + interlinear annotation marks (regression-tested);filename.rscollapses..before and after transliteration and handles post-truncation Windows reserved names. - FFI boundaries: all three shims are thin, return
Result→host-exception (nounwrapacross FFI), reject negative lengths before theusizecast (Nodei64checked_size), and themin_confidencerange/NaN check lives in the core (encoding.rs:101), not just the Python wrapper.
Low / advisory:
- L-S1 hostname.rs:151-159 feeds best-effort-decoded bytes from a malformed ACE
label into the advisory canonical / scripts fields. It cannot flip the
already-fail-closed suspicious=true verdict, so there's no impact on the asset the
threat model stands behind — but analysis.canonical of a malformed IDN shouldn't be
treated as authoritative. Consider skipping/marking-untrusted the canonical form on
the error arm. (Low)
- L-S2 recover_lock poison recovery (lib.rs:333-373) returns the inner guard
rather than propagating. Recovered std collections are valid-but-unspecified (never
UB — unsafe is forbidden), and current mutators don't panic mid-critical-section, so
this is a latent invariant to watch, not a present bug. (Low/informational)
Out of scope per THREAT_MODEL.md (not bugs): no input-size cap on transforms
(caller's responsibility); confusables outside the bundled TR39 table; whole-script and
multi-character (rn→m) spoofs; NFKC unmasking ASCII metacharacters
(<script>→<script>) — escape_html / percent_encode are terminal output
encoders the caller applies at the sink; pure-ASCII XSS/SQL payloads passing through.
Recommended priority order¶
- H-D1 / H-D2 / H-D3 (docs) — cheap, and they are correctness claims a user will trust; H-D1 touches a security-adjacent silent fallback the doc currently advertises as a guardrail. Highest value-per-effort.
- H-P2 + H-P3 (zalgo early-exit / round-trip) — the two DoS-adjacent perf items on long input; implement H-P3 with the NFC caveat above and a property test.
- H-P4 (preset allocation) — biggest steady-state perf win; the
*_intomachinery already exists, so it's a refactor onto an existing pattern. - H-P1 (fold_case single-pass) and H-P5 (width ASCII fast path) — constant-factor wins on hot paths.
- M-C1 (slug trailing separator) — correctness, but only on a non-default configuration; pair the fix with a multi-char-separator proptest.
- M-D1 / M-D2 (Boy-Scout: stale
allow, missing#[must_use]) and the remaining Medium/Low items as area is touched, per the project's broken-windows rule.
What was checked and found clean¶
Transliterate engine (ASCII passthrough → Cow::Borrowed, word-at-a-time ASCII-run
skipping, bounded capacity estimator, single-pass strict mode), table lookup layer
(atomic gates, prebuilt aho-corasick, no lock on the common path), normalize (ASCII
fast path, early-exit is_normalized), scripts (binary-searched ranges, short-circuit
mixed-script), grapheme/width segmentation correctness, confusables_cow,
emoji trie walk, Pipeline::process ping-pong, filename reserved-name handling
(incl. post-truncation), cross-binding default parity (separator, lowercase,
entities/decimal/hexadecimal, strip_control/strip_zero_width, filename defaults,
negative-size rejection), and build.rs codegen (repo-controlled TSV, compile-time
ASCII assertion).
Generated by the HAI-SDLC hardening sequence — analysis only. No source files were
modified. Findings cite path:line at commit fd7a54f; line numbers may drift as the
tree changes.