Changelog¶
All notable changes to this project will be documented in this file.
The format follows Keep a Changelog.
Version numbers use the MAJOR.MINOR.PATCH shape but follow disarm's own
release policy — patch = fixes/cleanups/docs, minor = features
or major refactors, and the major component denotes support status, not API
compatibility (see RELEASING.md).
Project renamed
translit→disarm(#264). Historical entries below predate the rename and refer to the old identity (translit-rson PyPI, thetranslitimport package, the_translitnative module); they are left unchanged because they were accurate for their release. Entries from this point on use thedisarmidentity.
[Unreleased]¶
Added¶
-
Strip invisible & non-interchange code points in the security presets (#413). The presets a service puts in front of an LLM, a logger, or a denylist now neutralize the dominant 2024–25 "ASCII smuggling" channels and the adjacent non-interchange classes that survive NFKC and the existing zero-width passes: the Unicode Tags block (
U+E0000–U+E007F, including the previously-missedU+E0001), variation selectors, the Combining Grapheme Joiner (U+034F, a denylist-evasion blocker), noncharacters, and the Private Use Area; the Braille Pattern Blank (U+2800) now folds to a space rather than surviving as invisible padding. None of this is a blanket delete — a well-formed emoji subdivision flag (U+1F3F4…U+E007F) is preserved, anddisplay_cleankeeps the VS15/VS16 presentation selectors after a base and preserves the PUA (icon fonts), while the comparison presets (security_clean,normalize_user_input,strip_obfuscation) strip it. Four standalone helpers —strip_tags,strip_variation_selectors,strip_noncharacters,strip_pua— are exposed across the Rust core and the Python, Node, and Ruby bindings for composing policy directly. Output change: the comparison presets now remove these classes; idempotency is preserved (a terminal NFC recomposes any base+mark adjacency a strip creates). -
Bidi-direction conflict detection (
has_bidi_conflict, #412). A new primitive that flags text mixing strong left-to-right and strong right-to-left characters — the precondition for Unicode Bidi display-reordering and the structural signal behind "BiDi Swap"-style spoofs (an LTR brand label stacked on an RTL domain,varonis.com.ו.קום). Unlike aU+202xoverride check, it fires on the real letters. Derived from disarm's own script ranges (no new table); exposed across the Rust core (disarm::api::has_bidi_conflict) and the Python (has_bidi_conflict,Text.has_bidi_conflict), Node (hasBidiConflict) and Ruby (Disarm.bidi_conflict?) bindings. -
HostnameAnalysisdirection fields (#412). The PythonHostnameAnalysisgainsbidi_conflict(folded intosuspicious),cross_label_script(the broader, non-folded cross-label fact), andlabel_scripts(per-label resolved scripts, left to right) for position-aware caller policy. -
Anomaly detection:
has_anomalies/inspect_anomalies(#389). An out-of-place-character detector: it flags text disguising a real word via a cross-script homoglyph, leet, single-letter segmentation, a zero-width / bidi control, or zalgo, and reports a technical fact, not intent (like the hostname analysis). Built on the core's own primitives plus a caller-supplied common-word lexicon (used only by the leet/segmentation branches; the others are script-agnostic). Exposed across the Rust core (disarm::api) and the Python, Ruby, and Node bindings, with a per-language usage page. A dated defensive publication — published as prior art so the method stays freely usable. -
Reusable anomaly lexicon handle (
Lexicon). The bindinghas_anomalies/inspect_anomaliesfunctions rebuilt a hash set from the caller's word list on every call; a new opaqueLexiconclass lets callers build the set once and reuse it across many calls (disarm.Lexicon(words)in Python,new Lexicon(words)in Node,Disarm::Lexicon.new(words)in Ruby). Both functions accept either the raw word collection (unchanged, back-compatible) or aLexicon. The Rust core already amortizes this (it takes&HashSet<String>), so this closes the gap only the FFI bindings had. -
Node.js docs + doc-example gate (#44). A
docs/node/getting-started page and API reference plug into the language-neutral structure (#50), with Node.js added to the Getting started and API Reference nav. Every Node// =>example is executed against the built addon byscripts/check_doc_node_examples.mjs— the Node analogue of the Sybil/Rust/Ruby doc gates — wired into thenodeCI job (which now also triggers ondocs/**), so the examples can't rot. -
Node.js binding (#44). A new
bindings/node/napi-rs addon exposes the pure-Rust core to Node with a fully-typed, idiomatic TypeScript surface —camelCasefunctions, options objects with sensible defaults, string-union token types, and aDisarmError/DisarmInvalidArgumentclass hierarchy. It covers the full plain-function surface (transliterate, confusables, slugify, normalization, text cleaning, graphemes, filenames, reverse/untranslatable, script analysis) and ships.d.tstypes. Two layers, like the gem: a raw napi shim (src/lib.rs) under a hand-writtenindex.ts. Built + vitest-tested in CI against the in-repo core (the #374 drift gate, nownode/"Node checks passed"), with apublish-node.ymlrelease workflow (per-platform prebuilds + npm provenance) sonpm i disarmneeds no Rust toolchain. -
Ruby: filename, reverse-transliteration, and script-analysis ops (#375). Completes the plain-function parity backfill:
sanitize_filename(platform:/max_length:/preserve_extension:),reverse_transliterate(lang:)(:el/:ru/:uk),find_untranslatable(→{ char:, offset: }hashes),detect_scripts,mixed_script?, andinspect_auto_lang(→ a:script/:chosen_lang/:reason/:discriminators_hithash) — thin wrappers over the coredisarm::api. -
Ruby: grapheme-cluster operations (#375). The binding gains
grapheme_len,grapheme_split,grapheme_truncate,grapheme_width, andterminal_width— user-perceived-character counting/splitting/truncation and East Asian Width display measurement (ambiguous_wide: falseby default), thin wrappers over the coredisarm::api. Continues the Ruby↔core parity backfill (#375) and unblocks the graphemes Ruby docs. -
Ruby: normalization + text-cleaning primitives (#375). The binding gains
normalize/normalized?(NFC/NFD/NFKC/NFKD),collapse_whitespace,strip_control_chars,strip_zero_width_chars,strip_bidi, andstrip_zalgo/zalgo?— the first batch of the Ruby↔core parity backfill (#375), which unblocks honest normalization/text-cleaning Ruby docs. Each is a thin keyword-argument wrapper over the coredisarm::api, carrying the core's defaults (normalize(form: :nfc),strip_zalgo(max_marks: 2),zalgo?(threshold: 3)). -
CI: the Ruby binding is built and RSpec'd against the local core on every PR (#374). A new
rubyjob inci.ymlcompiles the gem (Ruby 3.1–3.3) and runsrake specagainst the in-repo core — not the published one — on any PR that touches the binding or the core it wraps. It injects a CI-only[patch.crates-io]redirect so an unreleased core API change is actually exercised; the registry-core build inpublish-ruby.ymlis unchanged. A core change that breaks the gem (like the 0.10 tuple→struct return that shipped a broken gem, #364–#367) now fails the new "Ruby checks passed" gate on the PR that introduces it, not silently at release. -
CI: the docs' Rust and Ruby usage examples are now executed gates (#50). The per-language usage tabs are no longer illustrative — each is run in CI, the way the Python tabs already are (Sybil).
scripts/check_doc_rust_examples.pyextracts every``rust doc block, compiles and runs it against the pure core with#;scripts/check_doc_ruby_examples.rbevals every Ruby# =>line against the freshly-built gem. The Rust gate runs in theDoc testsjob; the Ruby gate runs in the Ruby workflow, now also triggered ondocs/**`. Catches the signature/output drift that the tabs introduced (which had shipped as non-compiling Rust until this gate). -
Ruby:
transliteratenow accepts alang:language profile. Previously the Ruby binding'stransliterateexposed onlyscheme:, so it could not reach the core's per-language profiles (a parity gap vs Python/Rust).lang:accepts a String or Symbol and composes withscheme:— e.g.Disarm.transliterate("Київ", lang: :uk) # => "Kyiv". Implemented over the core'sTransliteratebuilder via a generalized_transliterate_optsshim.
Changed¶
-
Re-point Greek small letter iota
U+03B9to the i-class, reverting #343 (#436).#343had re-pointed the bare iota fromi/іto thel/vertical-bar class (l/ӏ) to unify{ι, ӏ, ا}. That split the iota family — the accented iotasU+03AF(ί) andU+03CA(ϊ) still folded toiin the same table — and was shadowed in the security presets:security_cleanandstrip_obfuscationrun NFKC first, which decomposes the accented iotas to bare iota, so under #343 the whole family folded tolthere. It also contradicted the upstream Unicode TR39 mapping (03B9 → 0069, i.e.i) and missed the dominant spoof —normalize_confusables("bιtcoin")returned"bltcoin"instead of colliding with"bitcoin". The bare iota now folds toi(latin)/і(cyrillic), consistent with its accented forms, so the entire iota family folds to the i-class undernormalize_confusablesand the NFKC-first presets, and theι-for-ispoof is caught (bιtcoin → bitcoin). The genuine full-height bars —ӏ(palochkaU+04CF),ا(alefU+0627), and theU+2502/U+FFE8bars (#245) — stay in the l-class. The only confusable-table change is the single iota row in each target. -
security_cleanandnormalize_user_inputno longer neutralize path separators (#431, reverses #248). The presets previously rewrote/and\to_and collapsed..runs so the output was safe to drop into a filesystem path. That is sink-specific output sanitization — out of scope for the canonicalization presets per THREAT_MODEL.md — and it corrupted legitimate input: URLs, file paths, and any/- or\-bearing string came back mangled ("https://example.com/path"→"https:__example.com_path"). The presets now pass separators through verbatim. Migration: if you fed preset output straight into a filesystem path, defend traversal at the sink instead — callsanitize_filenameon the final path component, or validate against your own allowlist. A confusable fraction/division slash that NFKC folds to a real/is still normalized to/(that is canonicalization working as intended); it is just no longer rewritten away. The internalneutralize_path_separatorshelper is removed. -
collapse_whitespacefolds the full whitespace set and the blank-rendering code points; control/zero-width stripping is now a separate step (#433).collapse_whitespacewas category-driven and also deleted controls and zero-width characters inline. It now folds whitespace only, to a single space, over an explicit core-defined set: the line controls (TAB/LF/VT/FF/CR), the information separators (U+001C–U+001F), NEL, theZs/Zl/Zpspaces, and a blank-rendering set that category detection cannot reach —U+2800Braille blank and the Hangul fillersU+115F/U+1160/U+3164/U+FFA0(e.g.aㅤb→a b). Breaking:collapse_whitespacedrops itsstrip_control/strip_zero_widthparameters (Rust, Python, Node, Ruby) — it no longer deletes anything. Composestrip_control_chars/strip_zero_width_charsbefore it for the old behaviour; the presets do this internally, so their output is unchanged except for the line-control fix below.strip_control_charsnow preserves the whitespace controls (CR/VT/FF/NEL/U+001C–U+001F) so the fold can turn them into a space; it still removes NUL, DEL, and the rest of the C0/C1 block. ThePRESETSmetadata now lists the explicitstrip_control/strip_zero_widthsteps. -
security_cleannow caps combining marks (anti-zalgo, #429). The preset left zalgo-stacked tokens intact, so a mark-stackedadmindid not match its base form in a denylist/dedup comparisonsecurity_cleanis meant to canonicalize. It now caps combining marks at 2 per base (the same thresholdnormalize_user_inputalready used), removing abusive stacking while preserving legitimate diacritics —security_cleanstays accent-preserving (café→café,Việt→Việt; full accent folding remains insearch_key/sort_key). The cap runs after the invisible/control strip so a stripped character between marks cannot split a run and hide the count (#121), and idempotency is verified by the raw-equality property test. Output change: inputs with more than two stacked marks per base are now capped. -
is_suspicious_hostnameandhas_anomaliesnow flag bidi-direction conflicts (#412). These detectors strengthen as disarm grows. A hostname that mixes strong-LTR and strong-RTL characters (the "BiDi Swap" shape, e.g.varonis.com.ו.קום) is now flaggedsuspiciousvia the newbidi_conflictsignal — previously it slipped pastmixed_script(which is per-label) and was only caught incidentally, if at all. The anomaly detector gains abidi_mixedfinding kind for a token mixing strong-LTR and strong-RTL letters: it is the precise, reorder-capable subset ofmixed_scriptand additionally catches non-Latin RTL mixes (e.g. Cyrillic+Hebrew) the Latin-anchoredmixed_scriptrule could not see. Behaviour change: some inputs that previously reportedmixed_script(Latin+Hebrew/Arabic) now reportbidi_mixed, and some that reported clean now flag.bidi_conflict=False/ nobidi_mixedis not a safety guarantee. -
sort_keynow preserves base accented characters (#99.1).sort_keyis documented as a collation key — accented forms should stay distinct so the accent survives for ordering — but it sharedsearch_key's full transliteration pass, so it ASCII-folded every accent ("Über"→"uber") and produced output identical tosearch_key. It now transliterates only non-Latin scripts, preserving Latin accents (sort_key("Über")→"über",sort_key("Café")→"café") while still folding Cyrillic/Greek/etc. to a consistent Latin form ("Война и мир"→"voyna i mir").search_keyandcatalog_keyare unchanged — they still fold accents for exact-match lookup and dedup. A language profile no longer expands an accented Latin letter in a sort key (sort_key("Über", lang="de")is"über", not"ueber"). Output change: persisted sort keys for accented-Latin input will differ from 0.10 and should be regenerated. Applies across the Rust core and the Python, Ruby, and Node bindings. -
Docs: synced the public XMR benchmark claims to the v2 note (#399). The README, the docs landing page, the adversarial-defense page, and the unidecode-migration guide led with the v1 curated-set headline (XMR = 1.000 on the hand-curated pairs). They now lead with the v2 broad-sample measurement over the 1,314 single-codepoint TR39 sources whose skeleton is a single Latin letter: instance XMR 0.634 / 0.682 (95% CI) with ~95% per-source coverage (stated as a distinct quantity), plus the NFKC (0.103) and TR39-skeleton-oracle (1.000, by construction) baselines, citing the v2 DOI 10.5281/zenodo.20618323. The curated 1.000 is retained only as a labeled sanity check, and the curated set is described correctly (18 hand-curated Cyrillic pairs; the 19 Greek pairs were a separate experiment).
CITATION.cffis bumped to0.11.0with the note DOI. -
Docs: Node.js usage tabs across the guide pages (#44). The twelve guide pages that carry Python/Rust/Ruby tabs now also show a runnable Node tab — 38 tabs in all, matching the Ruby coverage. Every Node example is executed against the built addon by the doc gate (
scripts/check_doc_node_examples.mjs). -
Docs: completed the language-neutral restructure (#50). The Adversarial-Text Defense concept page now shows Python/Rust/Ruby usage tabs (no bare Python), and the stale untabbed
user-guide/getting-started.mdwas removed in favour of the per-language getting-started guides (now linked from the index nav). With every published binding carrying install + quickstart + API andmkdocs build --strictclean, all four #50 acceptance criteria are met. -
Docs: Ruby usage tabs across the guide pages unblocked by the parity backfill (#375/#50). The normalization, text-cleaning, graphemes, filenames, and language-detection guides now show a runnable Ruby tab beside Python and Rust — 17 tabs in all. Every Ruby example is executed against the built gem by the doc gate, so the tabs cannot rot.
-
Docs: language-neutral scaffold — first phase of the docs restructure (#50). Reshaped the documentation IA toward "language-neutral concept core + per-language specifics": a neutral landing headline (no longer "for Python") that routes by ecosystem; per-language Getting started pages under
docs/python/,docs/rust/, anddocs/ruby/; a shareddocs/concepts/which-function.mdconcept page (lifting the #328 decision table into the neutral layer); and anmkdocs.ymlnav reorganized into Getting started / Concepts / Guide / API Reference (Python · Rust) / Architecture / Migration / Reference / Project. Folded six previously orphaned pages into the nav. No library behaviour change; the per-topic concept/usage split and per-language example tabs land in following phases. -
Docs/metadata: scope
transliterate()vs the TR39 confusable functions (#328). The headline identity led with "TR39 confusable analysis", while the most discoverable function,transliterate(), performs the opposite mapping — phonetic BGN/PCGN romanization (Cyrillicр→r), not TR39 visual confusable folding (р→p). Clarified across every entry point with no behaviour change: the identity one-liner (README,docs/index.md,Cargo.toml,pyproject.toml,mkdocs.yml,CITATION.cff) now says visual confusable analysis and phonetic transliteration; a new "Which function do I want?" decision table sits near the top of the README and docs landing page; andtransliterate()'s docstring (hencedocs/api/transforms.md) and the README Quick Start block now state it is romanization, not homoglyph defense, pointing tonormalize_confusables()/strip_obfuscation()for the latter.
Deprecated¶
- Presets renamed to mechanism names; old names deprecated (#430). The three
presets whose
*_clean/normalize_user_inputnames overpromised safety — flagged as documentation defects inTHREAT_MODEL.md— are renamed to names that describe their mechanism. The rename is byte-stable (old(x) == new(x)for all inputs):
| Old name (deprecated) | New name |
|---|---|
security_clean |
canonicalize |
display_clean |
strip_format |
normalize_user_input |
canonicalize_strict |
The old names remain as deprecated aliases across every binding — Rust (free
functions + DisarmStr methods, #[deprecated(since = "0.11.0")]), Python
(each emits a DeprecationWarning; the Text builder's .security_clean() /
.display_clean() methods and the PRESETS keys are aliased too), Node
(securityClean, @deprecated), and Ruby (Disarm.security_clean, warns with
category: :deprecated). They are removed in 1.0. catalog_key,
search_key, sort_key, ml_normalize, and strip_obfuscation are
unchanged.
Fixed¶
-
Digit confusables fold to their digit, not a look-alike letter (#439). The confusable maps mapped many non-ASCII digit sources to letters or punctuation — Arabic-Indic
٠→.,١→l,٥→o, Devanagari/Bengali/NKO zeros→o/O, and the Unicode 16 outlined digits→O/→l. The root cause:gen_confusables.pyclassifies digits viaunicodedata, so running it under a Python whose Unicode table is older than the bundledconfusables.txtsilently mis-folds any digit that table doesn't yet know. The generator now (a) folds everyNddigit source to its canonical ASCII digit and (b) refuses to run under a Unicode table older than the data (warning on any mismatch). The maps are regenerated: every digit spoof now canonicalizes to the plain digit (٠/०/→0), keeping numbers numeric (thellm_guardrail"digits are never remapped to letters" guarantee). -
sort_key/search_key/catalog_keyare now idempotent across scripts and cases (#419). The transliterating key presets rantransliteratebeforefold_case, so a cased letter whose folded form is in the table but whose original is not — e.g. a Georgian Mtavruli capitalᲱ(U+1CB1), absent from the table, folds to Mkhedruliჱwhich transliterates tohe— only transliterated on the second pass, violatingf(f(x)) == f(x).fold_casenow runs beforetransliterateso both passes see the same form.search_key/catalog_keyadditionally fold again after transliterate, since full transliteration can emit uppercase ASCII (£→GBP,№→No) that the pre-fold can't reach — those keys are now lowercase and stable. Output change: a few currency/symbol inputs that previously produced uppercase keys now fold to lowercase. Idempotency is pinned by per-preset property tests. -
security_clean/normalize_user_inputidempotency on duplicate combining marks (#434, #416 residual). A duplicate combining mark broke the singleNFC → confusables → NFCsandwich: NFC composed only one mark onto the base, the TR39 fold dropped it, and the recomposing NFC reattached the spare mark — re-creating a foldable composed character the next call would consume, sof(f(x)) != f(x)("c"+◌̧+◌̧ →"ç"then"c"). The confusable fold is now iterated to a fixed point (each pass removes ≥1 mark, so it converges in a couple of iterations), making both presets true fixed points. The#416Hypothesis idempotency property is re-broadened and thenormalize_user_inputRust proptest strengthened from nfc-modulo to raw equality. -
Line controls no longer join tokens in
collapse_whitespace(#433). TAB and LF folded to a space, but VT, FF, CR, NEL, and the information separators (U+001C–U+001F) were deleted — soa+ CR +bbecameabwhilea+ LF +bbecamea b. All of them are Unicode whitespace; deleting them was an invisible-join (coalescence) vector. They now all fold to a single space, soa\rb→a b. The blank-rendering Braille and Hangul fillers, which category detection passed straight through, are folded too. -
security_clean/sort_keyidempotency on invisible-separated combining marks (#416). When an invisible code point separated a base character from a combining mark (e.g."a"+U+200B+ combining acute +"b"), the leading NFKC passed over the still-separated mark and the later zero-width strip then left the base and mark adjacent but decomposed — so the composed form appeared only on the second call, violating the documentedf(f(x)) == f(x)invariant (whichTHREAT_MODEL.mdclassifies as a vulnerability). An NFC pass after the strips now recomposes the adjacency on the first call, in the Rust core, so every binding inherits it. Forsecurity_cleana second, deeper cause was also fixed: TR39 confusable skeletoning is not normalization-stable (it drops the diacritic on some composed accented letters —ç→c,ø→o— but not the decomposed form, and can emit a decomposed skeleton likeÝ→Y+◌́), so the confusable fold is now sandwiched between two NFC passes and the pipeline is a verified fixed point under a strengthened raw-equality proptest. Output change: for these previously non-idempotent inputs the first call now returns the composed NFC form.sort_keywas affected only because it began preserving accents in #411 (search_key/catalog_key, which fold accents away, were never affected). A separate, pre-existingsort_keynon-idempotency (transliterate-before-fold-case on a case pair) is tracked in #419.
Internal¶
-
Dependency-freshness audit across every manifest + full dependabot coverage. Dependabot only watched the root cargo/uv/actions manifests, so the binding crates rotted a full major unseen (
napi2→3,magnus0.7→0.8)..github/dependabot.ymlnow watches every manifest — the core crate and both binding workspaces (cargo), the Node package (npm), and the Ruby bundle (bundler) — and a new dev-timescripts/audit_dependencies.pyaudits all of them against their registries in one command (--strictto fail on a major lag), run weekly by thedependency-auditworkflow. The guard makes any future config gap visible instead of silent. See DEPENDENCY_UPGRADES.md. The DCO check now exempts trusted GitHub App bots (*[bot]authors, e.g.dependabot[bot]) — matching the official DCO app's default — so dependabot's PRs can finally satisfy branch protection and auto-merge instead of every bump being silently blocked. -
The Tier 3 exhaustive+formal gate now guards every publish, not just PyPI/crates.io (#159, #395). The pre-publish regimen — the exhaustive Rust domain tests (
#[ignore]) and the Python formal invariants (@pytest.mark.formal) — moved out of an inline job inpublish.ymlinto a reusableworkflow_callworkflow (.github/workflows/tier3.yml) that all four publish paths depend on: the PyPI wheel, the crates.io core, the RubyGems gem, and the npm addon. Previously only the wheel and the core were gated, so a release whose core failed the exhaustive net could still ship the bindings. Also wired the exhaustive grapheme-integrity suite (exhaustive_grapheme, #174) into the gate alongsideexhaustive_transliterate— it was documented "run before release" but had never actually been in the release workflow. -
Binding publish workflows build against the in-repo core on non-publish events (#374, #396).
publish-ruby.yml'stestjob andpublish-node.yml'sbuildjob compiled the binding against the published core, so a pre-release binding that calls a core API not yet on crates.io (e.g.has_anomaliesbefore this release) failed to build on every PR/push — red onmainuntil the matching core shipped. They now apply the same CI-only[patch.crates-io]redirect to the in-repo core thatci.yml's drift gate uses, but only onpush/pull_request; onrelease/workflow_dispatchthe shipped gem and prebuilt addon still build against the published core, unchanged. -
Node binding: bumped vitest 3 → 4, dropping a vulnerable dev-only esbuild (#392, #394). The Node binding's test runner pulled in esbuild 0.27.7 — a dev-only transitive dependency, never part of the published npm package — which carried two HIGH advisories (
GHSA-gv7w-rqvm-qjhr,GHSA-g7r4-m6w7-qqqr). vitest 4 pulls vite 8, which demotes esbuild to an optional peer dependency, so the vulnerable package drops out of the resolved tree entirely (npm auditreports zero vulnerabilities). The Node test matrix is unchanged (20/22).
[0.10.0] — 2026-06-15¶
The multi-language milestone (epic #326): disarm becomes a publishable,
pyo3-free Rust crate with a first-class idiomatic Rust API, gains a Ruby
binding, and adds opt-in diagnostic logging — all over a single shared
pure-Rust core. The Python package is unchanged for callers (same import disarm
surface); the work is the core extraction and the new non-Python surfaces.
Added¶
- Pure-Rust core, published to crates.io (#38, #42). The default build is now
the pyo3-free core (
default = []); the Python extension is the opt-inextension-modulefeature, socargo add disarmpulls a clean Rust library with no libpython in its dependency tree (enforced by a CI gate: the defaultcargo tree -e no-devtree must contain nopyo3, matched case-insensitively). The codebase is organized in three layers: Layer-1pub(crate)algorithm cores, Layer-2 the publicdisarm::api, and Layer-3b the feature-gated pyo3 shims — all consuming one implementation. - Idiomatic Rust API (
disarm::api) (#352, #361, #362). The semver-governed crates.io surface: typed enums (TargetScript,Scheme,NormalizationForm,UrlComponent,Platform,ReverseLang) that each round-trip viaas_str/Display/FromStr; theTransliteratebuilder withScheme/OnUnknown(which carries its replacement in theReplace(String)variant); an opaqueErrorwith a stableErrorKind/code();Cow<'_, str>borrow-on-no-op returns; agraphemes()iterator; theSlugConfigbuilder; theDisarmStrextension trait for method-call syntax; named#[non_exhaustive]struct returns (EncodingDetection,DecodedText,HostnameAnalysis,Untranslatable— no anonymous tuples); and a guarded process-global registration API (register_lang/register_replacements/remove_replacement/clear_replacements/seal_registrations) that enforces the registration cap and the one-way seal latch. Two contract tests fail CI if apub fnever returns a tuple or a token enum loses its round-trip. - Ruby bindings — the
disarmRubyGem (#45, #357). A magnus-based native extension wrapping the pure-Rust core (no Python), with an idiomatic Ruby surface: keyword arguments with defaults, symbol tokens (:latin,:strict_iso9, …), a singletransliterate(text, scheme:), and aDisarm::Error < StandardErrorhierarchy. Precompiled platform gems (Linux x86_64/aarch64, macOS x86_64/arm64, Windows) install with no local Rust toolchain. - Opt-in, binding-neutral diagnostic logging (#208, #358). Behind the
log/log-contentfeatures (off by default), the core emits structured records at API boundaries via thelogfacade — zero cost when off (the macros compile to nothing) and never inside a per-codepoint hot loop (enforced by a source-scan test). Default-level records carry metadata only (lengths, counts, flags, durations, error codes — never input or output content, enforced by a redaction sentinel test); thelog-contentTRACE escape hatch routes its truncated samples through disarm's ownstrip_log_injection(dogfooding) so a log line can never forge a record.
Changed¶
- Native module renamed
disarm._disarm→disarm._core(#42). The public Python API is unchanged — callersimport disarm. The native module name is an implementation detail the public surface doesn't require; the package's own internals (and the type-stub drift checks) referencedisarm._coredirectly, so any consumer reaching into it should update the path.
Fixed¶
- Confusables: cross-script ASCII folds and additive Greek/Cyrillic pairs (#341, #342, #343), plus the halfwidth vertical form U+FFE8 residue (#245).
- Terminal width: corrected the additivity-across-space precondition (#279).
Security¶
- HAI-SDLC hardening pass over the Rust core (#360): a deep multi-pass review
(0 critical / 0 high) actioned into 21 fixes — tightened a hostname IPv6-literal
zone-id check, added limit-rejection logging, a unique-slug truncation-error fix,
and an allocation-free
is_normalized, among others.
Internal¶
- Wired Tier 3 (exhaustive + formal) into the release/publish gate (#159, epic #326).
publish.ymlnow runs atier3job on the release/publish trigger that executes the exhaustive Rust domain tests (cargo test --no-default-features --test exhaustive_transliterate -- --ignored) and the Python formal invariants (pytest -m formal, against a freshly built wheel). Every wheel/sdist build job and thepublishjobneeds:it, so a Tier-3 failure blocks the upload to PyPI — closing the gap where these tiers were a manual pre-release step. They remain excluded from fast PR CI; the#[ignore]/@pytest.mark.formalmarkers are untouched. - Split the 1,200-line
src/api.rsinto cohesive submodules (api/{safety,text,transliterate,presets}.rs) re-exported fromapi/mod.rs, with theDisarmStrtrait in the hub (#361). No public-path change. translit-rs0.8.2 redirect shim published so the old PyPI name points users atdisarm(#264 follow-up).
[0.9.1] — 2026-06-13¶
Added¶
strip_log_injection(text, *, replacement='\ufffd', keep_tab=False)(#307). A stateless, character-level encoder that makes untrusted text safe to write as a log line: it replaces CR/LF/NEL/LS/PS (record forging), NUL/C0/C1 controls (parser corruption), and ESC/DEL (terminal hijack via ANSI escapes) withreplacement(default U+FFFD).\tis neutralized by default (keep_tab=False) to block TSV/logfmt column injection. Idempotent; ASCII-clean fast-path returns the original object; never emits a raw CR/LF/ESC. It owns the log-record and operator-terminal sinks but makes no HTML-log-viewer-safety claim (that is stored XSS — encode at the viewer withescape_html) and is not a log4shell defense (see Threat Model).escape_html(text)andpercent_encode(text, *, component)output encoders (#311). Standalone terminal encoders applied at the output sink — deliberately notTextPipeline/PROFILESsteps (a pipeline is context-free; baking encoding in invites double-encoding and wrong-context escaping).escape_htmlescapes the five HTML metacharacters for element/quoted-attribute context (ASCII fast-path returns the original object; not idempotent by design).percent_encodedoes RFC 3986 percent-encoding for a requiredComponent(PATH/SEGMENT/QUERY/FORM; UTF-8 byte-based, ASCII output,FORMuses space→+). Both are mechanism-named and carry the #306 scope-boundary discipline: they are the narrow, context-pinned exception to "disarm is not an output sanitizer," not a general XSS/injection defense (see Threat Model).
Changed (breaking)¶
- Renamed
is_safe_hostname()→is_suspicious_hostname()and inverted its boolean. The old name asserted a safety it cannot guarantee —safe=Trueonly meant "no mixed-script label and no bundled-table confusable found," yet whole-script spoofs and out-of-table confusables still returnedsafe=True(the false-assurance pattern #306/#308/#309 removed elsewhere, but as a literalsafeboolean a caller branches on). The function now returns(suspicious, analysis)wheresuspicious=Truemeans a problem was detected; the result structSafeHostnameDetails→HostnameAnalysis, fieldsafe→suspicious(inverted). The granularscripts/mixed_script/has_confusables/canonicalfields are unchanged. No alias — invert call sites:safe, d = is_safe_hostname(h)→suspicious, a = is_suspicious_hostname(h). (#313) - Renamed policy profile
web_input_sanitize→normalize_web_input. Follows thesanitize_user_input → normalize_user_inputrename: "sanitize" wrongly implied output/injection safety, and was especially misleading here because this profile is lighter thannormalize_user_input()(NFKC + confusables only; no bidi/zero-width/control/zalgo stripping). Useget_pipeline("normalize_web_input"). No alias is kept. - Renamed
sanitize_user_input()→normalize_user_input(). The old name implied output sanitization (injection safety); this preset performs input Unicode normalization only and is not an XSS/SQL defense (see Threat Model). ThePRESETSregistry key changes to match ("normalize_user_input"). No alias is kept — update call sites directly.
Documentation¶
- Stated the XSS/injection scope boundary explicitly (#306): README, the docs site, and THREAT_MODEL now say plainly that disarm normalizes input and is not an output sanitizer — it performs no HTML/JS/SQL/shell escaping and never replaces context-aware output encoding at the sink (NFKC can even surface ASCII metacharacters from fullwidth lookalikes). This boundary is the conceptual basis for the renames and the new output encoders in this release.
Security¶
- Supply-chain hardening (#260): added
cargo deny(license allow-list, banned/wildcard crates, crates.io-only sources viadeny.toml) to the required Rust checks passed gate, alongside the existingcargo audit. Releases now attach a CycloneDX SBOM (*.cdx.json) of the Rust dependency graph, and PyPI distributions carry PEP 740 build-provenance attestations via OIDC Trusted Publishing. Verification is documented in SECURITY.md. - Bumped
pyo30.24 → 0.29, resolving two upstream advisories:GHSA-36hh-v3qg-5jq4(HIGH — out-of-bounds read innth/nth_backforPyList/PyTupleiterators) andGHSA-chgr-c6px-7xpp(MEDIUM — missingSyncbound onPyCFunction::new_closureclosures). Includes the binding-layer API migration the bump requires (GILwith_gil/allow_threads→attach/detach,PyObject→Py<PyAny>,downcast_exact→cast_exact); no functional change to any transform. (#315)
Internal¶
- Docs: build the MkDocs site in CI and deploy to Cloudflare Pages (served at the unchanged
docs.disarm.dev), replacing the Read the Docs trigger.mkdocs build --strictruns in GitHub Actions (Python-only — mkdocstrings parses source statically); push tomaindeploys production, PRs get preview deploys. Legacy/en/latest/*URLs 301 to root viadocs/_redirects. Removed.readthedocs.yamlandRTD_TOKEN. (#314) - CI: replaced the custom
conversations-resolved.ymlworkflow with GitHub's native Require conversation resolution before merging branch-protection setting. The bespoke "Conversations resolved" status check (#55) was flaky — stale check runs lingered after threads were resolved and blocked otherwise-green PRs. Behavior is unchanged (unresolved review threads still block merge), now enforced by the built-in gate instead of a workflow + required status check.
[0.9.0] — 2026-06-11¶
The first release under the disarm name — the continuation of translit-rs
(last released as 0.8.1). See #264 for the rename rationale. The 0.0.0 entries
on PyPI / crates.io / npm are name-reservation placeholders, not releases; 0.9.0
is the first functional disarm release.
Changed¶
- Renamed the project from
translittodisarm(#264). This unifies the distribution and import names under a singledisarm: - PyPI distribution
translit-rs→disarm;import translit→import disarm. - Native module
translit._translit→disarm._disarm; cratetranslit→disarm. - Console script
translit→disarm. - Breaking: the public base exception
TranslitError→DisarmError(the subclassesInvalidArgumentError/ResourceLimitError/UnsupportedErrorkeep their names).DisarmErrorremains aValueErrorsubclass, soexcept ValueErrorkeeps working. - Breaking: the context-dictionary environment variable
TRANSLIT_DICT_DIR→DISARM_DICT_DIR. - Canonical URLs moved to
https://disarm.dev/https://docs.disarm.dev; the repository moved tohttps://github.com/raeq/disarm.
Fixed¶
uv.locknow declaresrequires-python = ">=3.10", matchingpyproject.toml(it had drifted to>=3.9after the 3.10 floor landed in #277).
[0.8.1] — 2026-06-11¶
The final translit-rs release and the close of the 0.8 performance-hardening
arc. The project continues as disarm from 0.9.0 (#264); 0.8.1 exists to
publish honest, production-true benchmark numbers before the rename.
Changed¶
- Benchmarks now run in the fresh-string regime (#277, #302): every timed
call receives a newly constructed
str, the way production traffic always does. The prior cached-object measurement let CPython's per-objectAsUTF8cache hide ~105–137 ns/call of UTF-8 encode cost that onlytranslitpays (pure-Python comparators never callAsUTF8), flattering it. JSON records now carryregime: fresh-string/v2; pre-flip history is the cachedv1regime and must not be compared across regimes. - README short-string figures updated to the measured fresh-regime values: ~17× vs Unidecode (Latin), ~14× (mixed scripts), ~13× (Cyrillic/Greek); ~65 ns ASCII passthrough; the four-cell Unidecode-own sweep still holds (~1.3× on Unidecode's strongest case to ~25×), with a methodology note explaining the regime.
[0.8.0] — 2026-06-11¶
A performance and hardening release. The headline is a benchmark-gated
optimisation programme (#233) that makes short-string transliterate roughly
15–21× faster than Unidecode (up from ~7–9×) and beats Unidecode on its
own benchmark, while shrinking the library's static and resident memory.
Alongside it, a Unicode-security hardening sweep tightens is_safe_hostname,
the security presets, and the stateful slugifiers. Most changes are
behaviour-preserving; the exceptions are called out under Upgrade notes.
Upgrade notes¶
- Minimum Python is now 3.10 (was 3.9). The extension targets the stable-ABI
floor
abi3-py310, so a single wheel runs on 3.10+ and the per-call Python→Rust path crosses the boundary only once (#277). Python 3.9 wheels are no longer produced. is_safe_hostnamenow flags every mixed-script label as unsafe (#254), not only the four Latin-paired high-risk combinations. A label combining two scripts with no Latin confusable (e.g. Greek + Cyrillic) previously reportedsafe=True; it now returnssafe=False. This also flags benign combinations (e.g. Latin + CJK) — read themixed_script/scriptsfields if you need a more permissive policy. The check fails closed by design.- Security presets no longer synthesise path separators (#248): confusable
characters that normalise to
/,\, or..can no longer pass through the security/filename presets to forge path structure. rag_ingestnow runs the confusables step (#258): Unicode homoglyph spoofs are canonicalised during RAG ingestion instead of surviving it. Output of therag_ingestpreset may change for homoglyph-bearing input.- Stateful slugifiers validate
langat construction (Slugify,UniqueSlugify), closing the gap the 0.7.0 validation pushdown missed (#257); an invalidlang=now raises instead of being silently ignored.UniqueSlugifyalso honours property mutations made after construction (#249). - Auto-language discriminator behaviour was reconciled with its documented contract (#253) — auto-detection results may differ for a few ambiguous inputs.
- Correctness edge cases fixed (#255), which may change output: reverse
transliteration of all-caps digraphs and a
grapheme_truncateoverflow case.
Performance¶
- Short-string
transliterate: ~15–21× faster than Unidecode (#277). A call now crosses the Python→Rust boundary exactly once with Rust-side keyword defaults, extracts UTF-8 zero-copy, and returns already-ASCII input as the originalstrobject via a borrowedCow— roughly 70 ns with no allocation. - Beats Unidecode on its own benchmark (#281): translit wins all four cells
of Unidecode's
expect_ascii/expect_nonascii× ASCII/non-ASCII matrix, including Unidecode's strongest (ASCII-passthrough) case. - Smaller static tables (#237): the default BMP transliteration table became
a two-level page-table + interned-blob trie (~1 MB → ~58 KB), hanzi→pinyin
a dense interned array (~600 KB → ~50 KB), and the 11,172 Hangul
romanisations a single packed blob. No runtime data loading; no
unsafe. - Zero-copy context dictionaries (#238): the Arabic/Persian/Hebrew
dictionaries are read once and indexed by
(offset, len)spans instead of parsed into nestedHashMaps of owned strings — roughly halving their resident memory. Lookup is binary search; the two-step bigram path allocates no per-token key. - Linear-time scanning via Aho-Corasick (#242): global and slug replacements
use longest/first-match automata instead of repeated per-position probing; the
UniqueSlugifycollision counter is amortised; and multi-codepoint emoji are matched through a code-point trie. - Per-character hot-loop improvements — resolve-once language tables, block-table dispatch, ASCII-run skipping (#235); fewer copies on the ASCII/identity path (#236); chunked batch extraction that caps peak memory (#239); single-pass strict mode, O(u)→O(1) in time and space (#240); further ASCII fast-paths and removal of O(n·k) scans (#252).
- A benchmark harness with a deterministic iai-callgrind estimated-cycle gate guards every PR against regressions in CI (#234).
Note: the batch (
list[str]) API's advantage over a Python loop has narrowed for short strings now that a scalar call is ~70 ns — for tiny inputs it is at rough parity. Its durable value is the single GIL-released crossing (thread parallelism), not a raw per-call speedup. Seedocs/performance.md.
Added¶
TextPipeline(preset=…)constructor and related new-surface ergonomics (#259).- CLI:
slugifyhonours--lang; thestrip_bidi/strip_zalgosteps are exposed; error output is cleaned up (#250). - The
errorsparameter annotation now includes"strict"in the callable-module andTextwrappers (#247).
Changed¶
docs/performance.mdrewritten so every claim is CI-executed (Sybil) or linked to a recorded measurement, with a stated margin policy, varied scenarios, a prominent "where we are slower" section, and a credit paragraph for Unidecode and its lineage (#291).
Internal¶
- Resource-limit constants centralised in a single
src/limits.rsmodule so the library's resource posture has one audit surface (#256). - Cross-cutting Rust-core helpers (
apply_replacements,emit_warning) de-duplicated (#251). - Incorrect docstring examples in the Python wrapper modules corrected (#246).
[0.7.0] — 2026-06-10¶
A feature and architecture release. Headlines: a unified, catchable exception
hierarchy; terminal column-width measurement (terminal_width /
grapheme_width); native errors="strict" transliteration; LLM/RAG
guardrail pipeline presets; and a substantial push of validation and
configuration logic down into the Rust core, so the upcoming multi-language
bindings inherit one behaviour instead of reimplementing it. Most changes are
behaviour-preserving; the exceptions are called out under Upgrade notes.
Upgrade notes¶
- Exceptions now form a hierarchy. Every library error subclasses
TranslitError, withInvalidArgumentError,ResourceLimitError, andUnsupportedErrorbeneath it.TranslitErrorremains aValueErrorsubclass, so existingexcept ValueErrorkeeps working. Several error message strings were enriched/standardised (#186, #187) — code matching exact message text may need updating; code matching exception types is unaffected. lang=is validated even for ASCII input (#197). A binding-side ASCII fast path previously skipped language validation, sotransliterate("abc", lang="zz")silently returned the input; it now raisesInvalidArgumentError, matching how non-ASCII input always behaved.slugify_filename/Slugify(safe_chars=…)output corrected (see Fixed):slugify_filename("My Report.pdf")now returns"My_Report.pdf", not"My.Report_pdf". Output for inputs that usesafe_charsmay change.- New modes:
errors="strict"fortransliterate(#184) anddecode_to_utf8(strict=True)(#189).
Added¶
terminal_width/grapheme_width(#224): terminal column width per grapheme cluster (UAX #11 East Asian Width). Wide/fullwidth and emoji-presented clusters are 2 columns; combining marks, controls, and zero-width characters are 0. Ambiguous characters are 1 by default, or 2 withambiguous_wide=True. Width data is generated at build time from the pinned UCD (no runtime data, nounsafe). Measures cells, not pixels; tabs are not expanded.errors="strict"+find_untranslatable(#184): strict transliteration raises on the first untranslatable character (reporting it and its byte offset);find_untranslatablereturns all of them without raising.- Guardrail pipeline presets (#139):
TextPipelinegainsstrip_bidiandstrip_zalgosteps and thellm_guardrail/rag_ingestnamed profiles for LLM/RAG input sanitisation. get_pipeline/list_profiles(#229): the named policy-profile registry now lives in the Rust core; the Python helpers are thin wrappers over it.decode_to_utf8(strict=True)(#189): raise on lossy/replacement decoding instead of silently substituting U+FFFD.
Changed¶
- Unified exception hierarchy (#183): the Python error surface is a
TranslitErrorbase with categorised subclasses; sites that previously raised bareValueErrorare unified (foundation laid in 0.6.3 via #181). - Validation moved into the Rust core (#185, #217, #229, #230, #231): enum
validation, the
transliterate()argument-conflict matrix, non-negativemax_length/max_graphemeschecks,safe_chars, andmin_confidencerange-checking now live in the core, so other bindings enforce the identical contract without reimplementing it. The Python layer keeps only type guards. - Actionable error messages (#186, #187): weak messages now name the offending value, list valid options, and suggest a "did you mean…?" where applicable; message style is standardised across the surface.
- Error cause chains (#188): wrapped errors surface the underlying cause via
__cause__rather than flattening it into the message. TextPipelinestep ordering (#174) is derived from a single source of truth, removing drift between configuration and execution order.- All-ASCII preset fast path (#198): presets skip the NFKC pass for pure-ASCII input (behaviour-preserving).
Fixed¶
slugify_filename/Slugify(safe_chars=…)preserved safe characters at the wrong positions —slugify_filename("My Report.pdf")returned"My.Report_pdf"instead of the awesome-slugify-correct"My_Report.pdf".safe_charsare now handled natively in the Rust core: kept verbatim and treated as word characters so they hold their position (#156, #230). The prior test only covered a dot-free input, so the bug was uncaught; regression tests now cover filenames with extensions, multiple dots, andUniqueSlugify+max_length.slugify(default=…)is now sanitised through the same slug pipeline (so a caller-supplied fallback cannot smuggle path-traversal or URL metacharacters into output documented as URL-safe), threads through the statefulSlugifier/UniqueSlugifierforms, and a negativemax_lengthnow raises a catchableInvalidArgumentErroron both the scalar and batch paths instead of an uncatchableOverflowError(#193, #169).- Low-severity hardening bundle (#200): eight small robustness fixes (bounds, overflow, and edge-case handling) gathered into one pass.
Security¶
- The RustSec advisory audit (
cargo-audit) now blocks merge via the required "Rust checks passed" gate on every PR — an advisory can land on a dependency without any code change here (#195).
Removed¶
- Docker image build/publish and its Trivy CVE scan (#138). translit is a
pip install-first library; previously published images remain as historical artifacts, but no new ones are produced. Install the CLI viapip install translit-rs.
Documentation¶
- Executable cookbook (#154, #91, #140, #156, #172): a Sybil doc-test harness
with a CI gate, unidecode→translit migration recipes, an "LLM pipelines" page,
a tokenizer-preprocessing page, and an anti-rot lint that turned 307 decorative
# =>claims into checked assertions. - normalize-first canonicalisation recipe (#174) and a formal-verification assurance taxonomy (#223 — proof-by-exhaustion / structural / property-tested, tagging each I1–I7 invariant), plus grapheme-integrity property tests (#174).
- The project adopted the Developer Certificate of Origin (#165); all commits are signed off. The custom-emoji-provider 9-codepoint window cap is now documented (#199).
[0.6.3] — 2026-06-08¶
A correctness, maintenance, and architecture-foundation release. No output-affecting
changes — every fix is behaviour-preserving and the one new public behaviour
(slugify(default=...)) is opt-in. Headline: a pure-Rust error model is now in place,
laying the foundation for the multi-language bindings on the roadmap.
Upgrade notes¶
- No output-affecting changes. Existing output and every exception type/message are unchanged.
- New opt-in:
slugify(text, default="…")returns the fallback when the input has no sluggable characters (emoji / punctuation / zero-width) instead of"".default=None(the default) preserves the prior empty-string behaviour.
Added¶
slugify(default=...)— opt-in fallback for inputs that would otherwise slug to the empty string, closing an empty-slug routing hazard (#97).
Fixed¶
PRESETS["strip_obfuscation"]metadata now reflects the real pipeline order (confusablesruns afterdemojize), matchingsrc/presets.rs(#141).- Lock-poison recovery now emits a Python
UserWarningnaming the recovered table, instead of a silent stderr line (#117). docs/api/exceptions.mdcorrected —TranslitErrorinherits fromValueError(notException), and every example message string now matches the real output (#182).
Changed (internal — behaviour-preserving)¶
- Error model (#181, part of #180): a pure-Rust
Errorenum (thiserror) with a stablecode()per variant and a singleFrom<Error> for PyErrboundary; ~35 error sites migrated off in-corePyErrconstruction. Removes the core↔PyO3 coupling and lays the foundation for non-Python bindings. Python exception types and messages are unchanged. - Dependencies:
phf/phf_codegen0.11 → 0.13,criterion0.5 → 0.8,chardetng0.1 → 1.0 — each migrated and verified behaviour-preserving (#146, #153, #164). build.rsnow auto-discovers language override tables — adding a language is just dropping in atranslit_lang_*.tsv(#74).- Generated
.pyistubs are now guarded by a stub/binary signature drift-check, which caught and fixed 18 stale stub signatures (#76).
Maintenance¶
- Split
python/translit/__init__.py(2,683 lines) into_api.py+_presets.py(#73). - Split
tests/integration_transliterate.rsby script family (#75). - Process: a required "Conversations resolved" merge gate (#55); a documented
dependency-upgrade methodology with Dependabot cooldown + auto-merge
(
DEPENDENCY_UPGRADES.md,RELEASING.md).
[0.6.2] — 2026-06-07¶
A correctness, security, performance and maintenance release triaged from a
post-0.6.1 issue sweep (#101–#132). No public API removed; one small new public
behaviour (slugify(save_order=True) now functions). Two output-affecting
fixes — see Upgrade notes.
Upgrade notes (output-affecting)¶
slugify(save_order=True)was an accepted no-op; it now strips only leading/trailing stopwords (preserving interior word order), matching python-slugify (#118). If you passedsave_order=True, slug output changes.decode_to_utf8defaultmin_confidence0.5→0.95(#103). The old default was inert (the detector only reports0.50/0.95, and0.50 < 0.50is false), so it never rejected. It now requires high confidence by default; passmin_confidence=0.0to accept any guess. (No practical change today — the detector currently always reports0.95.)
Fixed¶
- #102 —
UniqueSlugifyno longer panics across the FFI boundary on a multibyte separator + smallmax_length(byte slice landed mid-codepoint; now usesfloor_char_boundary). - #101 — context bigram disambiguation tier was unreachable (it reset on every inter-word space); it now resets only on hard boundaries, so the tier fires in normal prose.
- #104 —
set_emoji_providernow obeysseal_registrations()(the provider swap previously defeated the seal). - #103 —
decode_to_utf8default confidence now actually gates (see notes). - #107 — a corrupt context dictionary now reports a distinct "corrupt" error
instead of the misleading "not found" remedy (
DictStateenum). - #121 —
PRESETS["sanitize_user_input"]now reflects the real pipeline order (strip invisibles before zalgo); Python registry and Rust doc aligned. - #129 —
Text.transliterate()stub now declares thetones/contextparameters the implementation accepts. - #131 —
Slugify(uids=...)emits a correct wrong-class warning rather than a spurious deprecation warning. - #122 — disambiguated the
_compatshould_warnnested ternary.
Security¶
- #105 — added a
cargo audit(RustSec advisory) CI job and acargoDependabot ecosystem. - #132 — added a Trivy CVE scan of the published image to the release
workflow (SARIF → Security tab, fails on fixable HIGH/CRITICAL) +
.trivyignore. - #106 — Rust diagnostics now route through Python
warningsinstead of bareeprintln!, so applications can capture/suppress them.
Performance (output-preserving)¶
- #108 codepoint-range diacritic checks in
tokenize(); #109mem::takeper token boundary; #110 singlech.nfkc()pass on the NFKC fallback; #111 loweredMAX_CAPACITY_HINT256 MiB → 8 MiB; #112/#113 emoji matching uses stack buffers + a fixed sliding window (no per-charVec/String); #114 slugify usesCow(no eagerto_owned); #115 contexttokenize()returns borrowed (Cow) slices of the input — zero per-token allocation (Rust API: the crate-internalcontext::Token.textchanged fromStringtoCow<'_, str>; no effect on the Python API); #116 clamped theContextDictcapacity hint.
Maintenance¶
- #118 implemented
slugify(save_order=True); #119SlugConfig::from_pyargsdedupes the four slugify PyO3 entrypoints; #120_build_slug_kwargshelper; #123 seal-enforcement docs on eachtables::mutator; #124 infallibility comments; #125 typed_CallableModule.__call__kwargs; #126 correctedrecover_lockdoc; #127 documented the lazy-import workaround; #128 renamed_mutation_generation→_registration_generation; #130 annotated the defence-in-depth conflict check.
[0.6.1] — 2026-06-07¶
A bug-fix and test-hardening release. No public API was removed and no new public names were added. One fix changes key output for inputs containing invisible characters — see Upgrade notes.
Upgrade notes (output-affecting fix)¶
search_key/catalog_key/sort_keynow strip bidi overrides and soft-hyphen / format characters (#93). Previously a value stored with an invisible character (e.g."password","usertxt") produced a different key from its clean equivalent, so dedup and lookup silently missed. The new key is the correct one; if you persist these keys, regenerate any that were computed over text that could contain invisible characters.
Fixed¶
- #93 — key functions (
search_key/catalog_key/sort_key) leaked bidi and soft-hyphen characters, so visually-identical inputs produced non-colliding keys. They nowstrip_bidiafter NFKC, matching the other canonicalization presets. - #82 — Greek reverse transliteration (
transliterate(text, target="el")) left literal Latin letters in the output ("psychi"→"ψyχη"). The forward direction romanizes Υ/υ asY/y(including the ου/αυ/ευ diphthongs), so theelreverse table now mapsY/yback to Greek; round-trips no longer leak Latin letters. - #69 —
transliterate()resolved conflicting kwargs differently forstrvslistinput (one path silently droppedtarget, the othercontext). Conflicts are now checked once, before the dispatch, so both raise identically:context+targetandcontext+tonesraiseValueError. - #72 —
translit.unidecode()now mirrors the Unidecode 1.3 signatureunidecode(string, errors="ignore", replace_str="?"), mapping Unidecode'serrorsmodes (ignore/replace/preserve/strict) onto the native error handling, instead of raisingTypeErroron those kwargs. - #95 — Greek Extended polytonic capitals for omicron/upsilon/omega/rho
were corrupted, emitting unrelated Latin letters (
Ὅμηρος→Xmiros,Ὑγίεια→Pgieia). Corrected all 50 affected entries to the proper base romanization, consistent with the monotonic forms (Ὅμηρος→Omiros). - #99.3 — a typo'd
form=/errors=value now raises even for pure-ASCII input. Previously the ASCII fast-path returned before reaching Rust, so the bad enum silently no-opped on ASCII and only raised on the first non-ASCII string. Validation now runs before the fast-path innormalize()andtransliterate().
Performance¶
- #70 — the batch entry points (
transliterate,slugify,normalize,strip_accentsonlist[str]) now release the GIL around their pure-Rust compute loop viapy.allow_threads. Multi-threaded callers processing large batches now get real parallelism (~1.8× wall-clock with two threads) instead of serialising on the interpreter lock. Output is unchanged. Documented in the new "Concurrency (GIL)" section ofdocs/performance.md.
Documentation¶
- #94 —
strict_iso9is no longer described as "ISO 9:1995". It emits ASCII digraphs (ж→zh, ч→ch, ш→sh), not the standard's diacritics (ž/č/š) — translit tables are ASCII-only by design. Docstrings, the data-file header, and the docs now describe it as a scholarly ASCII (ISO 9-style) transliteration and warn it is not ISO 9-conformant. No behavior change. - #98 —
docs/user-guide/transliteration.mdno longer instructs users topip install translit-rs[arabic|hebrew|context](those empty extras were removed in 0.6.0); it now documents thebootstrap_dicts.sh/TRANSLIT_DICT_DIRpath, matching the README and the runtime error message. -
#99.1 / #99.2 — fixed two false docstrings:
sort_keyno longer claims to preserve accents (it folds them via transliteration, coinciding withsearch_key), andslugifyno longer documents apretranslatekwarg it never had. -
#84 — corrected the README throughput table (Cyrillic ~106M chars/sec, slugify ~712K slugs/sec on commodity 4-vCPU hardware) and added a hardware/methodology footnote; added a matching variance note to
docs/performance.md. - #77 — fixed the
Textfluent-builder docstring example (normalizeis keyword-only:.normalize(form="NFC")), reconciled the language-profile count (README now agrees with the docs at 83), and documented thecontextkwarg in thetransliterate()docstring.
Internal / tests¶
- #78 — added adversarial coverage for the raw-bytes decode path
(
detect_encoding/decode_to_utf8): deterministic hostile-byte cases in CI plus a Hypothesisst.binary()fuzz suite proving no-panic and invariant-preservation. Documented inTHREAT_MODEL.mdthat the decode path has no input-size cap (caller's responsibility, per the 0.6.0 cap removal). - #79 — added a single-vs-batch kwarg parity regression test across the full
kwarg matrix and a multi-script corpus (the
tonesbatch drop fixed in 0.6.0 can no longer recur silently).
[0.6.0] — 2026-06-07¶
A hardening and bug-fix release. Two new opt-in helpers (dedup_batch,
make_cached_transliterator) make this a minor bump; no public API was
removed. Several fixes change output for specific inputs — read Upgrade
notes before upgrading if you cache or persist transliterator/normalizer output.
Upgrade notes (output-affecting fixes)¶
Each of these was a bug; the new output is the correct one. If you store or cache results that were keyed on the old (buggy) behaviour, regenerate them:
register_replacements()now actually applies. It was a silent no-op — the registered table was never consulted. Registered replacements now take effect acrosstransliterate()(scalar, list, andcontext=True). If you registered replacements and (knowingly or not) relied on them being ignored, output changes.transliterate(list, tones=True)now returns toned pinyin (was silently toneless on the list path);transliterate(list, target=…, tones=True)now raisesValueErrorfor the forward-only parameter (was silently ignored).normalize_confusables(text, target="cyrillic")no longer maps characters onto invisible combining marks (28 such mappings removed).strip_obfuscationnow folds intra-Latin ASCII homoglyphs (þ→p,ſ→f,ı→i, …) and is idempotent;sanitize_user_inputis idempotent for control/invisible characters between combining marks;demojizeno longer inserts a stray space after a tab/newline that precedes an emoji.- Context-aware transliteration (
context=True, ar/fa/he) distribution changed. The emptyarabic/hebrew/contextpip extras have been removed (they never installed anything). The ~37 MB dictionaries are no longer tracked in git, and are not shipped in the wheel. Context mode now loads dictionaries from$TRANSLIT_DICT_DIR(build them withscripts/bootstrap_dicts.sh), or use theembed-dictsCargo feature for a self-contained build. A packaged pip-installable distribution is tracked in #56/#60. decode_to_utf8defaultmin_confidencechanged0.0→0.5. Low-confidence encoding guesses are now rejected by default instead of silently accepted; passmin_confidence=0.0to restore the old behaviour. (#66)- Unknown
langcodes now raise instead of silently falling back (#68). A typo'd code (lang="RU",lang="russian") used to behave exactly likelang=None— quietly-wrong output — whileerrors=/form=rejected bad values.transliterate,slugify,sanitize_filename,catalog_key,search_key,sort_key, andml_normalizenow raiseTranslitErrorlisting the valid codes."auto", thenb/nn/daaliases, andregister_lang()codes are accepted. (target=already validated.)
Changed¶
- No library-imposed input-size limit (#80, #65). The 10 MiB input cap on
transliterate,normalize,fold_case, and the preset pipelines has been removed — it was paternalistic, inconsistently applied (the ASCII fast path bypassed it;slugify/normalize_confusables/strip_zalgonever had it), and the threat model already disclaims DoS. All operations are linear time and memory; bounding untrusted input is the caller's responsibility, documented in the threat model and docstrings. The single retained size guard is theregister_replacementsoutput amplification bound (a tiny input can expand to an enormous string via a caller-registered value — an amplification a caller's own input check cannot foresee). Backward-compatible: only previously-rejected large inputs now succeed. - External wording: capability, not promise. Security-relevant features are now described as mechanisms (TR39 confusable mapping, bidi/zalgo stripping, hostname analysis) rather than outcome guarantees. Package descriptions, README, and docs no longer claim to "prevent"/"neutralize" attacks or achieve "perfect" recovery; the XMR benchmark figure is always stated with its tested-pairs scope. Engineering rigor is held to a high internal bar (see below); the external surface promises nothing it cannot measure.
Added¶
dedup_batch(texts, …)— transliterate a list, processing each distinct value once and mapping back (large win for repeated/categorical data; ~146× on a high-locality column). Stateless — no cache to invalidate; unique values are chunked at the 100k batch cap. (#31)make_cached_transliterator(maxsize=…, …)— opt-in LRU-cached single-string transliterator with options fixed at construction. Self-invalidating: the next call after anyregister_lang/register_replacements/remove_replacement/clear_replacementsclears the cache (via an internal table-generation counter), so it never serves stale results. Never enabled by default. (#31)THREAT_MODEL.md— defines in-scope mechanisms, explicit out-of-scope items (confusables outside the bundled TR39 table, whole-script and multi-character confusables, Unicode-version skew, semantic attacks, DoS), and a vulnerability-vs- known-limitation policy, grounded in the literature (Holgers 2006, Deng 2020, BitAbuse 2025).SECURITY.mdrewritten on real footing: supported-version policy stated, triage scope defined, and linked to the threat model.- Security-invariant property tests + fuzzing.
proptestinvariants in Rust (src/presets.rs) assert no-panic, idempotence, and "no bidi/format control survives" forstrip_obfuscation/security_clean/sanitize_user_input/strip_bidiacross the Unicode input space; a deterministic, CI-gating adversarial attack-corpus regression (tests/test_attack_corpus.py: homoglyph / zalgo / invisible / bidi / combined, XMR-style); and acargo-fuzzharness (fuzz/) for continuous coverage-guided fuzzing of the defense pipelines. - Confusable coverage for intra-Latin homoglyphs of basic ASCII letters
(e.g.
þ→p,ſ→f,ı→i,ƒ→f,Ɩ→l,ꜱ→s). The TR39 generator previously skipped all Latin-script sources for the Latin target, dropping ~83 genuine homoglyphs of A–Z/a–z;normalize_confusables/strip_obfuscationnow fold them. Single-letter Latin confusable coverage of UTS#39 is now complete. - Pinned
data/confusables.txt(UTS#39 17.0.0) as the reproducible, version- controlled input forscripts/gen_confusables.py(--downloadrefreshes it), and atests/test_confusable_coverage.pygate against Unicode-version drift.
Fixed¶
register_replacements()was a silent no-op — the global table was stored but never consulted bytransliterate(). It now applies as a longest-match pre-pass (no cascade) across the scalar, list, andcontext=Trueforward paths, including ASCII-keyed replacements that previously bypassed Rust via the Python fast path. (#51)tones=on the list/batch path was dropped:transliterate(["北京"], tones=True)returned toneless pinyin while the scalar path returned toned, andtransliterate([...], target=…, tones=True)silently ignored the forward-only parameter instead of raising. Both now match the scalar path. (#14, #15)normalize_confusables(target="cyrillic")emitted invisible combining marks — 28 mappings folded a visible character onto a combining Cyrillic-Extended mark (an obfuscation vector). The generator now excludes combining-mark targets. (#24)script_info("CanadianAboriginal")["context_aware"]raisedKeyError— the entry omitted a requiredScriptMetafield; a completeness guard now prevents recurrence. (#18)- Context path skipped
strict_iso9/gost7034mutual-exclusion validation —transliterate(text, context=True, strict_iso9=True, gost7034=True)now raisesValueErrorlike the non-context path; the missing-dictionary error hint is now language-specific (he→hebrew). (#18) demojizeinserted a stray space after a tab/newline preceding an emoji ("a\t😀"→"a\t grinning face"); it now checks for any whitespace. (#12)- Compatibility digit variants fold to digits, not letters (#89). The
confusables table mapped Mathematical Alphanumeric digits
𝟎/𝟏(and the other four families, plus superscripts) to the look-alike lettersO/l, sonormalize_confusables("𝟏𝟎")gave"lO"andstrip_obfuscationcorrupted digit runs. The generator now folds any character whose NFKC form is an ASCII digit to that digit. They remain detected as confusable (is_confusable), but canonicalize to the correct number. (ASCII0/1were already unaffected.) - NFKC-compatible Latin is recovered instead of dropped to
[?](#81). Mathematical Alphanumeric Symbols (𝕳𝖊𝖑𝖑𝖔 𝟙𝟚𝟛→Hello 123), presentation ligatures (fi/fl→fi/fl), and superscripts (x²→x2) now transliterate: an unmapped non-ASCII char is NFKC-decomposed and re-tried before the error fallback. This matches unidecode/anyascii and closes a filter-evasion ("fancy text") gap. Purely additive — only chars that were previously[?]are affected; emoji (no ASCII decomposition) still map to[?]. - Defense pipelines are now idempotent (bugs found by the property tests):
strip_obfuscation: emoji whose CLDR name contains typographic punctuation (e.g.👒→woman’s hat, U+2019’) weren't folded because confusables ran before demojize; a second pass folded’→'. Confusables now runs after demojize.sanitize_user_input: an invisible or control character between combining marks (e.g. soft-hyphen, NUL) split a mark-run, so removing it after zalgo-capping merged runs that a second pass then capped differently. Bidi, zero-width, and control characters are now stripped before zalgo-capping.- Build-time and doc corrections:
build.rsnow rejects malformed\u{…}escapes in TSV data; embedded-dictionary parse errors are logged (not silently dropped); and numerous stale docstrings/comments were corrected (script_to_langreturns ISO 639-1 or 639-3;normalize()ASCII fast-path; list single-Rust-call caveats).
Security¶
seal_registrations()/registrations_sealed()(#64, high). Theregister_lang/register_replacementsAPIs mutate process-global tables consulted by everytransliterate/slugify/catalog_key/… call, so in a multi-tenant or web process one import or request handler could silently alter everyone's canonicalization.seal_registrations()is a one-way latch: after it is called, register/remove/clear raiseTranslitError. The registration APIs are now documented as startup-only/single-writer. Separately, a poisoned lock no longer resets registrations to defaults (a panic in one thread could previously wipe another caller's registered languages) — it now recovers the data as-is.is_safe_hostnamenow decodes IDN/xn--labels (#63, high). Previously anxn--ACE label was pure ASCII → single-script → reported safe, so the on-the-wire form of the IDN homograph attack (a Cyrillicxn--80ak6aa92e.com"apple" spoof) sailed through — the exact blind spot for a library marketingidn/anti-spoofing. ACE labels are now UTS#46-decoded (via theidnacrate) before script/confusable analysis; a malformed ACE label is treated as unsafe. Non-xn--labels are untouched (no false positives on, e.g.,my_host.local).is_safe_hostnamefails closed (#67.1). A confusable-check error no longer silently degrades to "not confusable" (unwrap_or(false)) → "safe"; it now marks the hostname unsafe.strip_bidi/display_cleannow also strip deprecated format controls (U+206A–U+206F) and interlinear annotation marks (U+FFF9–U+FFFB) (#67.2), which were previously only handled as transliteration-table entries.- NFKC×confusables composition pinned (#67.3). Added a regression test fixing
the exact set of NFKC-ASCII results that
normalize_confusablesre-maps (`→',"→'',|→l) so a data/ordering change — e.g. reintroducing digit→letter — fails loudly; and that presets resolve NFKC/TR39 conflicts (ſ→s) via NFKC. - Context dictionaries are no longer loaded from a CWD-relative path (#61).
load_dict_from_fspreviously probed./data/{name}_dict.binfirst, so a process whose working directory an attacker influences (or where they can drop./data/) could inject a substitute dictionary and silently change ar/fa/he output. Dictionaries now load only from$TRANSLIT_DICT_DIR(explicit opt-in) or the crate's own absolutedata/path in source builds. - Supply-chain: corpus inputs are verified/pinned (#62). The Tashkeela corpus
archive is now checksum-verified before it feeds the builders (fail-closed — an
unpinned checksum aborts unless
ALLOW_UNVERIFIED_CORPUS=1), and the Project Ben Yehuda corpus is fetched at a pinned commit instead of an unpinned live HEAD. ContextDict::from_bytesis fully bounds-checked. A malformed or truncated context dictionary previously caused an out-of-bounds panic (the crate isunsafe_code = forbid, so a panic aborts the process). Every read is now bounds-checked and section offsets are validated; capacity hints are clamped. Added truncation/bogus-offset/u32::MAX-count unit tests. (#18)register_replacementsexpansion is bounded. Replacement values are caller-controlled and unbounded; a small input with a large value could expand past the transliterate input cap. Output is now bounded during construction and rejected once it would exceedMAX_TRANSLITERATE_INPUT_BYTES. (#51)
Internal / tests¶
- 170 deterministic tests were excluded from CI. A module-level
pytestmark = pytest.mark.hypothesisintest_filename_regressions.pyandtest_case_folding.py(filename-security and case-folding regressions) deselected the entire files under CI's-m "not hypothesis"filter; only ~10 were actual property tests. The mark is now scoped to the property-test class in each file, so the deterministic tests run in CI. (#12) - New tests:
register_replacements(unit + Hypothesis property), context-dict parser robustness,resolve_auto_langfor all 18 scripts added in v0.3.0+, and aSCRIPT_METAfield-completeness guard. - CI/workflow hygiene: concurrency group on secret-scan,
uv.lockin the benchmark path filter, and CodeQL no longer triggered by Rust-only changes.
[0.5.0] — 2026-06-06¶
Added¶
- Context-aware transliteration for abjad scripts (Arabic, Persian, Hebrew).
transliterate(text, context=True)uses dictionary-based vowel restoration with bigram context disambiguation to produce readable romanized text instead of consonant skeletons. - Arabic: Tashkeela corpus (65.7M words), 182K unigrams + 200K bigrams. Covers 99%+ of newspaper vocabulary.
- Hebrew: Project Ben Yehuda corpus (11.4M words), 227K unigrams + 200K bigrams. Covers literary Hebrew.
- Persian: 266 curated common words + optional Wiktionary expansion (14.9K entries available via harvester script).
list_context_langs(): returns language codes that supportcontext=True(currently["ar", "fa", "he"]).LangMeta.contextfield:"full","partial", or"none"— enables web/WASM clients to show/hide a context toggle per language.ScriptMeta.context_awarefield:bool— enables toggle per detected script.- Dictionary build tooling:
scripts/build_arabic_dict.py— corpus-based Arabic dictionary builderscripts/build_hebrew_dict.py— corpus-based Hebrew dictionary builderscripts/build_persian_dict.py— curated vocabulary Persian builderscripts/harvest_wiktionary_persian.py— Wiktionary Persian harvesterscripts/bootstrap_dicts.sh— reproducible bootstrap from zero with pinned checksums. All parameters auditable, no manual steps.- Abjad transliteration documentation (
docs/user-guide/abjad-transliteration.md) covering all three languages, standards used, comparison with other systems. - pip extras:
pip install translit-rs[arabic],[hebrew],[context]for optional context dictionary installation. - Rust context engine (
src/context.rs): binary dictionary reader, Arabic/Hebrew tokenizer, three-tier resolve (bigram → unigram → context-free fallback), lazy-loaded global singletons viaOnceLock. - 28 context-aware tests (8 Arabic, 14 Persian, 6 Hebrew).
Changed¶
- Repositioning (docs + metadata only — no API or coverage changes). The project now leads with its differentiated, proven core: Unicode adversarial-text defense and canonicalization (TR39 visual confusable mapping), with standards-based Latin/Cyrillic/Greek transliteration as the supporting pillar and CJK/Indic/other scripts framed as best-effort, unidecode-compatible coverage.
- Rewrote the package description, keywords, and classifiers (added
Topic :: Security) acrosspyproject.toml,Cargo.toml, andmkdocs.ymlto surface the security use case for discovery. - Restructured
README.md/docs/index.mdto lead with defense; introduced an explicit three-tier coverage model (core / compatibility / best-effort). - Added an Adversarial-Text Defense guide (
docs/security/adversarial-defense.md) documenting the phonetic-vs-visual distinction, the XMR metric, and benchmark evidence; elevated security to a top-level docs navigation section. - Reframed the Unidecode migration guide: the
unidecodealias is for romanization compatibility, not security (it cannot reverse homoglyph attacks).
Fixed¶
- Linux x86_64 wheels are now built as
cp39-abi3instead of a version-specificcp38-cp38wheel. Previously the only published x86_64 Linux wheel targeted CPython 3.8, sopipfell back to a source build (requiring a Rust toolchain) on Linux x86_64 for Python 3.9+. The publish workflow now pins the build interpreter and guards against the regression. (#26) - Documentation: corrected the built-in language-profile count (inconsistently
reported as 64 in one place; now consistently 83), and fixed several homoglyph code
examples whose expected output was wrong (e.g. leading-character ordering in
strip_obfuscationexamples). All README/doc examples are now verified against the built library.
Security¶
- Pinned all third-party GitHub Actions to commit SHAs across the CI and release
workflows (resolves the CodeQL
actions/unpinned-tagfindings) and added.github/dependabot.ymlto keep them current. This hardens the release pipeline, which uses PyPI trusted publishing (id-token: write). - Bumped dev/docs dependencies flagged by Dependabot: Pygments → 2.20.0 and pytest → 9.0.3 (the pytest bump applies on Python ≥ 3.10; Python 3.9 stays on pytest 8.4.2, since pytest 9 requires ≥ 3.10). Both are development-only — the package has no runtime dependencies.
Notes¶
- No public API, language registry, or script coverage was removed. All existing imports, language codes, and the pinned API surface are unchanged.
[0.4.0] — 2026-03-29¶
Added¶
strip_obfuscation()preset pipeline: maximum-strength text deobfuscation using TR39 confusable mapping (visual similarity). Neutralizes homoglyph spoofing, zalgo abuse, invisible character injection, and bidi attacks. Does NOT transliterate — chain withtransliterate()explicitly if romanization is also needed. Pipeline: NFKC → strip_zalgo(max_marks=0) → confusables → strip_bidi → strip_zero_width → demojize → strip_accents → fold_case → collapse_whitespace.lang_info()andscript_info()APIs: return structured metadata (display name, script, region) for any language code or script. Backed byLANG_META(83 entries) andSCRIPT_META(55 entries) with import-time drift assertions.- 18 new language codes: ban (Balinese), bax (Bamum), bug (Buginese), chr (Cherokee), cjm (Cham), cop (Coptic), khb (Tai Lue), lis (Lisu), mni (Meitei), nod (Northern Thai), nqo (N'Ko), sat (Santali), su (Sundanese), syr (Syriac), tdd (Tai Le), tl (Tagalog), tzm (Tamazight), vai (Vai). Total: 83 languages.
- 10 new Script enum members: Bamum, Buginese, Cham, Lisu, MeeteiMayek, OlChiki, Sundanese, Tagalog, TaiTham, Tifinagh. Total: 57 scripts.
- Transliteration provenance documentation (
docs/provenance.md): per-block audit of which formal romanization standard each Unicode block follows. - API surface stability tests (
tests/test_api_stability.py): 133 tests locking down function signatures, class methods, enum members, TypedDicts, protocol interfaces, and__all__exports. - Mutation testing survivor killers (
tests/test_mutant_killers.py): 92 tests targeting forward-only parameter validation, default parameter sensitivity, pipeline step tuples, and boundary checks. - Language consistency audit (
scripts/audit_language_consistency.py): checks 11 registration points for Rust/Python/docs/test alignment. Wired into pre-push gate. - 283 empty-string mappings for combining marks and zero-width characters in
translit_default.tsv— these are now silently stripped instead of producing[?]. docs/index.mdis now generated fromREADME.mdviascripts/generate_docs_index.sh— single source of truth, no more drift.
Fixed¶
strip_obfuscation()homoglyph resolution: used phonetic transliteration (Cyrillic р→r, с→s) instead of TR39 visual confusable mapping (р→p, с→c). Removed transliterate from the pipeline; confusables now handles homoglyphs.- Combining marks produce
[?]:transliterate("n\u0303")returned"n[?]"instead of"n". Added empty-string TSV mappings for all Combining Diacritical Marks (U+0300–U+036F), Extended (U+1AB0–U+1AFF), Supplement (U+1DC0–U+1DFF), Symbols (U+20D0–U+20F0), and Half Marks (U+FE20–U+FE2F). - Zero-width characters produce
[?]:transliterate("a\u200Bb")returned"a[?]b". Added empty-string mappings for ZWS, ZWNJ, ZWJ, word joiner, BOM, soft hyphen, bidi marks, and line/paragraph separators. TextPipelineconfusable ordering: confusables ran before transliterate, creating mixed-script gibberish on Cyrillic/Greek input. Swapped execution order so transliterate runs first (matchingcatalog_keypreset).demojize()adjacent emoji concatenation:demojize("🔥🔥")returned"firefire"instead of"fire fire". Added space padding between adjacent emoji-to-text replacements.- SCRIPT_RANGES sort order: MeeteiMayek Extensions was misplaced, breaking
binary search for Ethiopic Extended-A. Added
test_script_ranges_sortedinvariant. - Tibetan incorrectly documented as Wylie: actual mappings use Indic-phonetic romanization (ཅ→cha, not Wylie's ca).
Changed¶
- BREAKING:
transliterate_batch(),slugify_batch(),normalize_batch(), andstrip_accents_batch()removed. The base functions now accept bothstrandlist[str]via@typing.overload. Pass a list to get batch processing:transliterate(["café", "naïve"])→["cafe", "naive"]. - BREAKING:
strip_obfuscation()no longer transliterates. Uses TR39 confusables (visual mapping) instead.lang=parameter removed. Chain withtransliterate()explicitly if romanization is also needed. - CI restructured: lint/test on PRs only (not push-to-main), hypothesis tests excluded (~4s vs ~46s), CodeQL moved to workflow file with path filtering, benchmarks split to own workflow.
- Pinned
ruff==0.15.4in CI andpyproject.tomlto prevent format drift. - Python 3.9 remains a supported runtime (
requires-python = ">=3.9", abi3-py39) but was removed from the release CI matrix; CI runs on Python 3.10+ because tests use PEP 604 (X | Y) syntax withoutfrom __future__ import annotations.
[0.3.0] — 2026-03-28¶
Added¶
- Unicode coverage expansion: 2,553 new codepoints across 33 Unicode blocks,
bringing total
translit_default.tsventries from 6,633 to 9,186.
Tier 1 — Forms and extensions (~1,741 codepoints): - Fullwidth ASCII (FF01–FF5E): 94 characters, mechanical offset mapping - Halfwidth Hangul (FFA0–FFDC): 66 characters via compatibility jamo - Enclosed/Circled Alphanumerics (2460–24FF): 160 characters (①→1, Ⓐ→A) - Superscript/Subscript (2070–209F): 29 characters mapped to base forms - Roman Numerals (2160–2188): 41 characters (Ⅰ→I, Ⅱ→II, ... Ⅻ→XII) - Modifier Letters (02B0–02FF): 80 characters (ʰ→h, ʷ→w) - IPA/Phonetic Extensions (0250–02AF): 96 characters (ɑ→a, ʃ→sh, ŋ→ng) - Greek Extended (1F00–1FFF): 233 characters (polytonic → base Greek → Latin) - Hangul Jamo (1100–11FF): 256 individual jamo components - Kangxi Radicals (2F00–2FD5): 214 radical forms → pinyin via CJK decomposition - CJK Compatibility Ideographs (F900–FAFF): 472 characters → pinyin via canonical decomposition targets
Tier 2 — Living scripts (~812 codepoints): - Gap-filling for 7 partially-covered scripts: Balinese, Canadian Syllabics, Cherokee, Coptic, N'Ko, Syriac, Vai - 10 new abugida scripts with virama/inherent-vowel handling: Sundanese, Tai Tham, Cham, Batak, Buginese, Tagalog, Hanunoo, Buhid, Tagbanwa, Meetei Mayek - 4 new alphabetic/syllabic scripts: Tifinagh, Lisu, Ol Chiki, Bamum
- Unicode range constants for 12 new scripts in
src/unicode_ranges.rs:SUNDANESE,TAI_THAM,CHAM,BATAK,BUGINESE,TAGALOG,HANUNOO,BUHID,TAGBANWA,MEETEI_MAYEK,MEETEI_MAYEK_EXT. - 10 new
*_char_role()functions insrc/transliterate.rsfor abugida virama handling (Sundanese, Tai Tham, Cham, Batak, Buginese, Tagalog, Hanunoo, Buhid, Tagbanwa, Meetei Mayek). scripts/generate_unicode_expansion.py: reproducible generator script for all Tier 1 and Tier 2 TSV entries (1,310 lines).cargo-clippypre-commit hook mirroring CI-D warningsto catch lints before push.- Callable module:
import translit; translit("Москва", lang="auto")now works as a shorthand fortranslit.transliterate(...). Uses in-place__class__mutation to preserveunittest.mock.patchcompatibility.
Fixed¶
- Finnish transliteration: removed incorrect alias
fi→sv. Finnish ä/ö are independent phonemes (→a/o via default table), not ae/oe variants as in Swedish/German.Hämäläinennow correctly producesHamalainen. - Icelandic transliteration: removed incorrect ð→dh and Ð→Dh overrides. Default table already maps ð→d (ICAO/passport standard). Retained Æ→Ae override (differs from default AE). Icelandic override count reduced from 6 to 2.
- clippy
manual_range_patternslint inbuginese_char_role: collapsed0x1A17 | 0x1A18 | 0x1A19..=0x1A1Bto0x1A17..=0x1A1B. errors="preserve"dropping visible characters: characters with explicit empty-string TSV mappings (e.g. U+060E Arabic Poetic Verse Sign, U+30FC Katakana Prolonged Sound Mark) are now preserved instead of silently dropped whenerrors="preserve"is set.
Changed¶
is_indic()andindic_char_role()expanded to cover all 11 new Brahmic/abugida script ranges.lookup_lang(): Finnish no longer dispatches to Swedish override table; falls through to default.- Icelandic language TSV (
translit_lang_is.tsv) reduced from 6 to 2 entries. ml_normalizepreset: switched transliteration fromPreservetoIgnoreerror mode — ML pipelines need clean ASCII output, not preserved non-ASCII.
[0.2.0] — 2026-03-27¶
Added¶
- Exhaustive testing framework — three layers of machine-verifiable assurance:
- Compile-time assertions (
build.rs): all transliteration table values asserted ASCII-only, entry count sanity checks (Hanzi ≥20k, BMP ≥5k, confusables ≥1k). Build fails if any assertion is violated. - Exhaustive domain tests (Rust): 16 tests covering all 11,172 Hangul syllables, full BMP (63,488 codepoints) for ASCII output and idempotence, all 20,992 CJK ideographs, all 51 compatibility jamo, and structural verification of 15 Indic script blocks. Zero sampling gaps.
- Stated invariant specifications (Python): 7 stated invariants (I1–I7) verified via exhaustive enumeration and Hypothesis — ASCII passthrough, ASCII output, idempotence, no exceptions, determinism, input size bound, output length bound.
- Two-tier test architecture: formal tests gated behind
#[ignore](Rust) and@pytest.mark.formal(Python) so they don't slow everyday development. Run before release withcargo test -- --ignoredandpytest -m formal. - CLAUDE.md: project-level development guide for automated agents — documents build commands, test tiers, and code conventions.
list_scripts()function for programmatic script discovery.docs/formal-verification.md: specification document for exhaustive testing methodology.- Comprehensive overhaul of
docs/architecture/testing-guarantees.mdwith exhaustive testing differentiator analysis and alternative library comparison.
Changed¶
IndicRoleenum andindic_char_role()/ script-specific char_role functions changed from private topubfor integration test access (parent modules remain#[doc(hidden)]).tables::hangulmodule changed frommodtopub modfor integration test access.- Hangul const assertions added:
JUNGSEONG_COUNT,JONGSEONG_COUNT, total syllable count, and compatibility jamo range verified at compile time. - Total test count: 2,900+ (up from 1,678 in 0.1.5).
[0.1.5] — 2026-03-27¶
Added¶
- Reverse transliteration:
transliterate(text, target="ru")converts Latin → native script for Russian, Ukrainian, and Greek. PHF tables generated at build time from inverted language TSV data. - Toned pinyin:
transliterate("北京", tones=True)returns"běi jīng"with tone marks. Toned readings sourced from UnihankMandarinfield for all 20,924 CJK Unified Ideographs. - ISO 9:1995 scholarly Cyrillic:
transliterate(text, strict_iso9=True)for scholarly romanization. GOST R 7.0.34 variant viagost7034=True. - Japanese Kunrei-shiki (
lang="ja-kunrei"): alternative romanization profile, bringing total language count to 65. - Ancient scripts: Coptic, Gothic, Old Italic, Runic, Ogham transliteration tables.
- CLI short aliases:
t(transliterate),s(slugify),n(normalize),p(pipeline),d(demojize) — e.g.translit t "café". - CLI
--targetflag:translit t --target ru "Moskva"for reverse transliteration. - CLI
--tones,--strict-iso9,--gost7034flags for transliterate subcommand. - CLI
--langflag for slugify subcommand. console_scriptsentry point:translitcommand available afterpip install translit-rs.docs/cli.md: comprehensive CLI documentation with piping, exit codes, examples.- Links section in README.md and docs/index.md for RTD ↔ GitHub cross-references.
Changed¶
transliterate()API unified:reverse_transliterate()merged intotransliterate()viatargetparameter. Old function removed.transliterate_implRust signature now takes 7 arguments (addedtones: bool).- Updated benchmark numbers after
tonesparameter addition (15–46% regression in transliteration hot path due to additional branch; throughput now 450M chars/sec Latin, 130M chars/sec Cyrillic). - Performance documentation updated across 4 files to reflect current benchmark results.
Fixed¶
- clippy
format_push_stringlint inbuild.rs— replacedpush_str(&format!())withwrite!(). - clippy
unreadable_literalin PHF-generatedreverse_translit_phf.rs— suppressed via inner attribute insrc/reverse.rs. - All 219 integration test call sites updated for 7-argument
transliterate_impl.
[0.1.4] — 2026-03-25¶
Added¶
lang="auto"script-based language detection: Whenlang="auto"is passed totransliterate(),slugify(),TextPipeline,Slugifier, or any other call site, the library detects the dominant non-Latin script in the input and maps it to a default language code automatically. Maps 28 scripts to language codes (e.g. Cyrillic→ru, Han→zh, Hiragana/Katakana→ja, Thai→th). Zero overhead forlang=Noneor explicit lang codes.LANG_AUTOconstant ("auto") intranslit._enums.- Georgian transliteration (
lang="ka"): 114 TSV entries covering Mkhedruli, Mtavruli, and supplement ranges. BGN/PCGN national romanization. - Armenian transliteration (
lang="hy"): 86 TSV entries covering uppercase, lowercase, and 5 ligatures (U+FB13–FB17). BGN/PCGN romanization. - Sinhala transliteration (
lang="si"): 90 TSV entries. Extended Indic Brahmic engine range from0x0900..=0x0D7Fto0x0900..=0x0DFFwith dedicatedsinhala_char_role()function for Sinhala-specific offsets. - Thai transliteration (
lang="th"): 87 TSV entries using RTGS romanization. NewScriptClass::Taiwith tone-mark stripping and cancellation handling. - Lao transliteration (
lang="lo"): 67 TSV entries using BGN/PCGN romanization. Shares Tai engine with Thai via offset masking. - Ethiopic transliteration (
lang="am"): 307 TSV entries for Ge'ez alphasyllabary (34 consonant bases × 7 vowel orders + labialized forms + digits). Pure data addition — no engine changes needed. - Myanmar transliteration (
lang="my"): 89 TSV entries. Newmyanmar_char_role()for Brahmic engine with virama (U+1039) and asat (U+103A) support. Medials (U+103B–103E) classified as dependent vowels. - Khmer transliteration (
lang="km"): 110 TSV entries. Newkhmer_char_role()for Brahmic engine with coeng (U+17D2) as virama. All consonants normalized to inherent 'a' regardless of series. - Tibetan transliteration (
lang="bo"): 147 TSV entries. Newtibetan_char_role()for Brahmic engine with halanta (U+0F84) and subjoined consonants (U+0F90–0FBC). - Unicode range constants:
TIBETAN(0x0F00–0x0FFF),MYANMAR(0x1000–0x109F),KHMER(0x1780–0x17FF) insrc/unicode_ranges.rs. - Comprehensive test coverage: example-based tests for all 9 new scripts, property-based tests (hypothesis + proptest), multi-script mixture tests.
- Built-in language count: 51 → 60.
Changed¶
is_indic()extended to include Tibetan, Myanmar, and Khmer ranges for Brahmic abugida processing.indic_char_role()dispatches to script-specific functions for Sinhala, Tibetan, Myanmar, and Khmer codepoint ranges.
[0.1.3] — 2026-03-25¶
Added¶
strip_controlandstrip_zero_widthnow work as independent pipeline steps without requiringcollapse_whitespace=True. Previously they were silently ignored whencollapse_whitespacewas disabled.strip_control_chars()andstrip_zero_width_chars()standalone Rust functions for filtering without whitespace collapsing.decimalandhexadecimalflags inSlugConfigare now functional. Settingdecimal=Falsepreserves&#NNN;entities;hexadecimal=Falsepreserves&#xHHH;entities. Previously these flags were accepted but silently ignored.- Rust integration tests:
tests/integration_emoji.rs(10 tests),tests/integration_slugify.rs(20 tests),tests/integration_transliterate.rs(21 tests),tests/integration_whitespace.rs(12 tests).
Changed¶
TextPipelineparametersstrip_controlandstrip_zero_widthchanged frombool(defaultTrue) tobool | None(defaultNone). WhenNone, they inherit fromcollapse_whitespace—Trueifcollapse_whitespace=True,Falseotherwise. Set explicitly toTruefor standalone use withoutcollapse_whitespace. This is backward compatible: existing code that passescollapse_whitespace=Truegets the same behavior as before.steps()now reportsstrip_controlandstrip_zero_widthas separate entries when active, giving full visibility into pipeline behavior.- Pipeline step order updated:
normalize → confusables → demojize → strip_accents → transliterate → fold_case → strip_control → strip_zero_width → collapse_whitespace. - Migrated from
once_celltostd::sync::LazyLock/OnceLock; MSRV bumped to 1.80. Removedonce_celldependency. needs_cjk_space()match arm tightened from wildcard_to explicitIdeograph | Hangul | Kanato match the call-siteis_cjkguard.
Fixed¶
decode_entities()corrupting multi-byte UTF-8 characters (BUG-1). The function usedbytes[i] as charwhich treated each continuation byte as a separate Latin-1 codepoint (e.g.café→café). Now advances by full UTF-8 characters.decode_numeric_entity_skip()panicking on malformed&#followed by multi-byte UTF-8 (BUG-2). The skip function walked through continuation bytes looking for;, landing inside a multi-byte character. Now stops at the first non-ASCII byte.
Performance¶
- ASCII fast-path in
demojize_implanddemojize_rust: pure-ASCII text returns immediately withoutVec<char>allocation or emoji scanning. filter_stopwordsreplaced intermediateVec<_>+.join()with a pre-allocatedStringfold, removing one allocation per slugify call.
[0.1.2] — 2026-03-25¶
Added¶
- Python 3.14 support (classifier and CI test matrix).
ruff check --fixpre-commit hook for automatic lint fixing.- CI publish workflow using
pypa/gh-action-pypi-publishwith OIDC trusted publishers. - Multi-platform wheel builds: Linux (x86_64, aarch64), macOS (Intel, ARM64), Windows.
steps()method on_TextPipelinetype stub.
Changed¶
- Resolved all clippy pedantic warnings instead of suppressing them — reduced
lint suppressions from 48 to 22 (remaining are genuine PyO3 constraints).
Fixes include: combined identical match arms, replaced manual counters with
.enumerate(), moved item declarations before statements, usedclone_into(), merged identical branches, fixed doc comment formatting. - Widened
stopwordsandreplacementstype stubs from stricttuple/listtoSequencefor better mypy compatibility. - Applied
ruff formatto all Python source and test files. - Switched docs publish from deprecated
maturin uploadtopypa/gh-action-pypi-publish. - macOS Intel wheels now cross-compiled on ARM64 runner (macos-14) instead of deprecated macos-13.
- CI doctests now run against installed package (not source tree) with explicit
shell: bashfor Windows compatibility.
Fixed¶
TextPipeline.explain()doctest: output format isnormalize (NFC)notnormalize (form=NFC).from __future__ import annotationsplacement in test files (must follow module docstring, not precede it).- Malformed HTML entity test expectation:
decode_entities("&#xyz;")correctly returns"", not"yz;". - Rust benchmark CI: target
bench_corebinary explicitly to avoid passing Criterion flags to the test harness. - Ruff lint fixes: unsorted imports in
test_encoding.py, unused importis_mixed_scriptintest_security_invariants.py. - Read the Docs trigger workflow: simplified curl status handling, graceful
warning when
RTD_TOKENis missing. - Removed incorrect PyPy classifier (abi3 is CPython-only).
[0.1.1] — 2026-03-25¶
Added¶
src/unicode_ranges.rs— named constants for all Unicode codepoint ranges used by the library, eliminating magic numbers scattered across modules.tests/test_concurrency.py— concurrent access tests forLANG_TABLESandHANGUL_CACHE, plus malformed Unicode input tests.- Code coverage reporting in CI (
pytest-cov, XML report uploaded as artifact). CLOCK$,KEYBD$,SCREEN$,COM0,LPT0added to Windows reserved filename list.casefold()alias forfold_case()— matchesstr.casefold()naming.remove_accents()alias forstrip_accents()— matches sklearn/ML ecosystem naming.- Compatibility parameter aliases:
replacement_text/max_lenonsanitize_filename()(pathvalidate),greedy/preferred_aliasesonis_confusable()(confusable_homoglyphs),delimitersondemojize()(emoji library). - Complete API documentation for 19 previously undocumented exported functions:
precompiled pipelines, grapheme clusters, encoding detection,
Textbuilder,is_safe_hostname,demojize,strip_bidi,EmojiProviderprotocol. - Three new API reference pages: Precompiled Pipelines, Grapheme Clusters, Encoding.
- "Guides by role" section in
docs/index.mdandREADME.md. - Performance section in
README.mdwith benchmark numbers. Scriptenum documentation expanded from 28 to all 41 members.
Changed¶
transliterate_implrefactored: capacity estimation extracted toestimate_capacity(), character classification toclassify_char(), and CJK spacing logic toneeds_cjk_space().- All
RwLockaccesses now recover from lock poisoning using.unwrap_or_else(|e| e.into_inner())instead of silently falling through. - Lambda closures in
_compat.pyreplaced with named inner functions for clarity. emoji.rswrite!()call no longer uses.unwrap()(infallible, documented with a// SAFETYcomment).- MkDocs theme switched from
materialtoreadthedocs. - All documentation references updated from "unirust" to "translit".
- Development status promoted from Alpha to Beta.
- Package renamed from
translittotranslit-rson PyPI (interim until PEP 541 grants thetranslitname). Python import remainsimport translit.
Fixed¶
- Type stub
_text.pyiimported from wrong module name (unirust→translit). - Type stub
_translit.pyimissingmin_confidenceparameter on_decode_to_utf8. - Type stub
_text.pyimissinggrapheme_split,grapheme_truncate,catalog_keymethods. security_clean()pipeline step order corrected in 5+ locations: strip_bidi runs before collapse_whitespace (matching Rust implementation).catalog_key()step order corrected: transliterate before strip_accents.- Stale PyO3 boundary overhead corrected from ~4µs to ~240ns in docs and code comments.
Deprecated¶
translit._compatawesome-slugify compatibility layer (Slugify,UniqueSlugify,slugify_*instances) — planned removal in v1.0.
[0.1.0] — 2026-01-01¶
Added¶
- Initial release.
- Unicode transliteration for 60 language profiles.
- Slugification, normalization, confusable detection, filename sanitization.
- Emoji demojization with ZWJ sequence support.
- Backward-compatible layers for Unidecode and awesome-slugify.