Ruby API reference

The Ruby surface is the singleton methods on the Disarm module plus the Disarm::Error hierarchy. Every method is a thin, idiomatic wrapper over the pure-Rust disarm core (no Python): keyword arguments carry the core's defaults, and scheme/target tokens are symbols (:latin, :default, …) or strings. The binding is a deliberate subset of the Rust disarm::api surface — see bindings/ruby/lib/disarm.rb for the authoritative definitions; if a topic isn't here, that binding doesn't expose it yet.

For install and a five-line tour, start with disarm for Ruby. The shared, language-neutral explanation of what each operation does lives under Concepts and Guide in the sidebar — this page is the Ruby call surface, not the rationale.

require "disarm"

Every example below is executed against the built gem in CI (the Ruby doc gate), so the documented outputs cannot rot.

Transliteration

Disarm.transliterate(text, scheme: :default, lang: nil)

Romanize Unicode text to ASCII. scheme: selects the standard — :default (general-purpose), :strict_iso9 (ISO 9:1995-style ASCII), or :gost7034 (GOST R 7.0.34). lang: applies a language profile on top of the scheme (sparse overrides — e.g. "de" maps üue, "uk" sharpens Ukrainian); nil means no profile. Both accept a String or Symbol.

This is phonetic romanization for legibility, not a security control — reach for normalize_confusables to defend against homoglyphs.

Disarm.transliterate("café")                       # => "cafe"
Disarm.transliterate("Москва")                     # => "Moskva"
Disarm.transliterate("Київ", lang: "uk")           # => "Kyiv"
Disarm.transliterate("Юрий", scheme: :strict_iso9) # => "Jurij"
Disarm.transliterate("Москва", scheme: :gost7034)  # => "Moskva"

Confusable folding

Disarm.normalize_confusables(text, target: :latin)

Fold cross-script confusables toward target: (:latin or :cyrillic) using the TR39 visual mapping. This is the homoglyph defence — it canonicalizes look-alikes (Cyrillic а → Latin a) rather than romanizing.

Disarm.normalize_confusables("раypal")             # => "paypal"

Disarm.confusable?(text, target: :latin)

Whether text contains any character confusable with target: (:latin or :cyrillic). A true is a positive finding; a false asserts only that none of the bundled confusables were found, not that the text is safe.

Disarm.confusable?("pаypal")                       # => true
Disarm.confusable?("paypal")                       # => false

Slugs

Disarm.slugify(text, separator: "-", lowercase: true, max_length: 0, …)

Generate a URL-safe slug. Mirrors the core's SlugConfig defaults; every option past text is keyword-only (separator:, lowercase:, max_length:, word_boundary:, save_order:, stopwords:, allow_unicode:, lang:, entities:, decimal:, hexadecimal:, safe_chars:).

Disarm.slugify("Héllo Wörld")                      # => "hello-world"
Disarm.slugify("café au lait")                     # => "cafe-au-lait"

Canonicalization primitives

Disarm.strip_accents(text)

Strip diacritics.

Disarm.strip_accents("café")                       # => "cafe"

Disarm.fold_case(text)

Full Unicode case fold — more aggressive than String#downcase (e.g. German ß folds to ss).

Disarm.fold_case("HELLO")                          # => "hello"
Disarm.fold_case("Straße")                         # => "strasse"

Disarm.demojize(text, strip_modifiers: false)

Replace emoji with their plain names. strip_modifiers: drops skin-tone / variation modifiers before naming.

Disarm.demojize("👍")                               # => "thumbs up"
Disarm.demojize("Café ☕")                          # => "Café hot beverage"

Deobfuscation & security presets

Disarm.strip_obfuscation(text)

Remove obfuscation — zero-width characters, bidi controls, combining-mark abuse, and TR39 homoglyphs — while keeping legible content. It does not transliterate; chain transliterate if you also need ASCII.

Disarm.strip_obfuscation("рroduсt")                # => "product"

Disarm.canonicalize(text)

Aggressive security cleaning: NFKC, confusable folding, bidi stripping, and whitespace collapse in one preset.

Disarm.canonicalize("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥")                 # => "Real text"

Collation & lookup keys

Stable keys for searching, sorting, and deduplicating text across cases, accents, and scripts. All three accept a lang: profile (a String or Symbol; nil means none) and raise Disarm::InvalidArgument on an unknown lang.

Disarm.search_key(text, lang: nil)

Case/accent/script-insensitive search-index key — fold to a single canonical form so "Köln" and "koln" collide in a lookup table.

Disarm.search_key("Köln")                          # => "koln"
Disarm.search_key("Café")                          # => "cafe"

Disarm.sort_key(text, lang: nil)

A collation/sort key that preserves base accented characters — unlike search_key it keeps the accent (so accented and unaccented forms stay distinct), while still folding non-Latin scripts to Latin.

Disarm.sort_key("café")                            # => "café"
Disarm.sort_key("Éclair")                          # => "éclair"

Disarm.catalog_key(text, lang: nil, strict_iso9: false)

Library-catalog deduplication key — search_key plus confusable folding. strict_iso9: selects the ISO 9:1995 Cyrillic scheme for transliteration.

Disarm.catalog_key("Толстой")                      # => "tolstoy"
Disarm.catalog_key("Толстой", strict_iso9: true)   # => "tolstoj"

Pipelines

Disarm.get_pipeline(profile)

Build a reusable Disarm::Pipeline for a named policy profile (e.g. "search_index"). The profile's steps are validated and assembled once at construction, so the returned handle's #process(text) can be called many times without re-resolving the profile. Raises Disarm::InvalidArgument on an unknown profile name.

pipe = Disarm.get_pipeline("search_index")
pipe.process("Café")                               # => "cafe"
pipe.process("Köln")                               # => "koln"

Hostname / IDN analysis

Disarm.suspicious_hostname?(host)

Whether the hostname looks like a mixed-script / confusable IDN spoof. As with confusable?, a false asserts nothing was found — it is not a safety guarantee. See the Threat Model for what is and isn't in scope.

Disarm.suspicious_hostname?("pаypal.com")          # => true
Disarm.suspicious_hostname?("example.com")         # => false

Normalization

Disarm.normalize(text, form: :nfc)

Apply a Unicode normalization form — :nfc (default), :nfd, :nfkc, or :nfkd (a Symbol or String, case-insensitive).

Disarm.normalize("fi", form: :nfkc)                 # => "fi"
Disarm.normalize("2²", form: :nfkc)                # => "22"

Disarm.normalized?(text, form: :nfc)

Whether text is already in normalization form: (default :nfc).

Disarm.normalized?("café", form: :nfc)             # => true
Disarm.normalized?("fi", form: :nfkc)               # => false

Text cleaning

Disarm.collapse_whitespace(text)

Fold every run of Unicode whitespace to a single ASCII space, and trim leading/trailing whitespace. Since #433 this folds whitespace only — the line controls (CR/VT/FF/NEL/U+001CU+001F) and the blank-rendering set (Braille blank, Hangul fillers) fold to a space rather than being deleted, so "a\rb""a b". It does not delete control or zero-width characters; use Disarm.strip_control_chars / Disarm.strip_zero_width_chars for that.

Disarm.collapse_whitespace("  a   b ")             # => "a b"

Disarm.strip_control_chars(text) · Disarm.strip_zero_width_chars(text) · Disarm.strip_bidi(text)

Remove, respectively, C0/C1 control characters (except tab/newline), zero-width characters (ZWSP/ZWNJ/ZWJ/word-joiner), and Unicode bidirectional controls — the invisible characters used to obfuscate or spoof text.

Disarm.strip_control_chars("a\u0007b")              # => "ab"
Disarm.strip_zero_width_chars("a\u200Bb")           # => "ab"
Disarm.strip_bidi("a\u202Eb")                       # => "ab"

Disarm.strip_tags(text) · Disarm.strip_variation_selectors(text) · Disarm.strip_noncharacters(text) · Disarm.strip_pua(text)

Strip the invisible / non-interchange code-point classes weaponized for "ASCII smuggling" into LLMs and adjacent hygiene (#413): the Unicode Tags block (preserving valid emoji flag sequences), every variation selector, every noncharacter, and the Private Use Area. These are the composable primitives behind the security presets, which strip them automatically.

Disarm.strip_tags("a\u{E0001}b")                    # => "ab"
Disarm.strip_variation_selectors("g\u{FE01}data")   # => "gdata"
Disarm.strip_noncharacters("a\u{FFFE}b")            # => "ab"
Disarm.strip_pua("a\u{E000}b")                      # => "ab"

Disarm.strip_zalgo(text, max_marks: 2) · Disarm.zalgo?(text, threshold: 3)

zalgo? flags "zalgo" — combining marks stacked past threshold: on a base character; strip_zalgo caps each base character at max_marks: combining marks.

Disarm.zalgo?("Z\u0301\u0301\u0301\u0301")                       # => true
Disarm.zalgo?(Disarm.strip_zalgo("Z\u0301\u0301\u0301\u0301"))  # => false

Grapheme clusters

Operate on user-perceived characters (grapheme clusters) rather than code points — an emoji, a flag, or a base-plus-combining-mark counts as one.

Disarm.grapheme_len(text)

Number of grapheme clusters (contrast String#length, which counts code points).

Disarm.grapheme_len("a👍b")                        # => 3
Disarm.grapheme_len("🇬🇧")                          # => 1

Disarm.grapheme_split(text)

Split into an array of grapheme-cluster strings.

Disarm.grapheme_split("a👍")                       # => ["a", "👍"]

Disarm.grapheme_truncate(text, max_graphemes)

Truncate to at most max_graphemes clusters, never cutting through one.

Disarm.grapheme_truncate("héllo", 3)               # => "hél"
Disarm.grapheme_truncate("a👍b👎", 2)               # => "a👍"

Disarm.grapheme_width(cluster, ambiguous_wide: false) · Disarm.terminal_width(text, ambiguous_wide: false)

Display width in terminal columns by East Asian Width — grapheme_width for a single cluster, terminal_width for a whole string. Pass ambiguous_wide: true to count ambiguous-width characters as two columns.

Disarm.grapheme_width("👍")                         # => 2
Disarm.terminal_width("a👍")                        # => 3

Filenames

Disarm.sanitize_filename(text, separator: "_", max_length: 255, platform: :universal, lang: nil, preserve_extension: true)

Turn arbitrary text into a filesystem-safe filename. platform: is :universal (default), :windows, or :posix; preserve_extension: keeps the final extension when truncating to max_length:. Raises Disarm::InvalidArgument on an unknown platform.

Disarm.sanitize_filename("My: report*.txt")         # => "My_report.txt"
Disarm.sanitize_filename("CON", platform: :windows) # => "_CON"

Reverse transliteration & untranslatable scan

Disarm.reverse_transliterate(text, lang:)

Reverse-transliterate Latin back to a native script. lang: is :el (Greek), :ru (Russian), or :uk (Ukrainian).

Disarm.reverse_transliterate("Moskva", lang: :ru)  # => "Москва"
Disarm.reverse_transliterate("Athina", lang: :el)  # => "Αθηνα"

Disarm.find_untranslatable(text, scheme: :default, lang: nil)

Every character with no romanization — the ones transliterate would replace — as { char:, offset: } hashes (byte offset), in order. scheme:/lang: mirror transliterate.

Disarm.find_untranslatable("a🜊")                   # => [{ char: "🜊", offset: 1 }]
Disarm.find_untranslatable("café")                 # => []

Script analysis

Disarm.detect_scripts(text) · Disarm.mixed_script?(text) · Disarm.bidi_conflict?(text)

The Unicode scripts present (first-appearance order, Common/Inherited excluded), whether more than one script is present, and whether the text mixes strong left-to-right and strong right-to-left characters — the "BiDi Swap" display-reorder precondition (fires on real letters, no U+202x override).

Disarm.detect_scripts("aМ")                        # => ["Latin", "Cyrillic"]
Disarm.mixed_script?("aМ")                         # => true
Disarm.bidi_conflict?("helloא")                    # => true  (Latin + Hebrew)
Disarm.bidi_conflict?("helloМ")                    # => false (both LTR)

Disarm.inspect_auto_lang(text)

Explain how lang: "auto" detection resolves text — a hash with :script and :chosen_lang (both nil if undetected), the :reason, and any :discriminators_hit.

Disarm.inspect_auto_lang("Москва") # => { script: "Cyrillic", chosen_lang: "ru", reason: "script_default", discriminators_hit: [] }

Disarm.lang_info(code) · Disarm.script_info(name)

Curated metadata for one language code (e.g. "de") or one script name (e.g. "Coptic"), each a hash with symbol keys. lang_info returns { name:, script:, region:, context: } (where :context is "none"/"partial"/"full"); script_info returns { name:, default_lang:, example:, context_aware: } (:default_lang is nil when none). Each raises Disarm::InvalidArgument on an unknown code/name.

Disarm.lang_info("de")[:name]                      # => "German"
Disarm.lang_info("de")[:script]                    # => "Latin"
Disarm.script_info("Coptic")[:default_lang]        # => "cop"
Disarm.script_info("Coptic")[:context_aware]       # => false

Disarm.list_scripts · Disarm.list_context_langs

Enumerate what disarm knows: list_scripts is every Unicode script as a stable UCD identifier (includes "Common"/"Inherited"), sorted by name; list_context_langs is the language codes with context-aware transliteration support, sorted by code. Both return an Array<String>.

Disarm.list_scripts.include?("Latin")              # => true
Disarm.list_context_langs                          # => ["ar", "fa", "he"]

Anomaly detection

Disarm.has_anomalies?(text, lexicon) · Disarm.inspect_anomalies(text, lexicon)

Flag text carrying out-of-place characters that disguise a real word — a cross-script homoglyph, leet, segmentation, a zero-width / bidi control, or zalgo. Reports a technical fact, not intent. lexicon is a common-word Array or Set (used only by the leet and segmentation branches). inspect_anomalies returns a hash with :anomalous, :kinds, :findings (each { kind:, token:, start:, end:, detail:, reason: }), and :reason. See Anomaly Detection for the detected classes.

Disarm.has_anomalies?("get fr33 now", ["free"])        # => true
Disarm.inspect_anomalies("paypаl", ["paypal"])[:kinds] # => ["mixed_script"]

Errors

Everything disarm raises descends from Disarm::Error < StandardError, so a single rescue Disarm::Error catches the whole surface. Bad input — an unknown scheme/target token, a non-String argument, a negative max_length — raises the more specific Disarm::InvalidArgument (itself a Disarm::Error), with the original native backtrace preserved.

Class Raised for
Disarm::Error Base class — rescue this to catch everything.
Disarm::InvalidArgument An invalid argument (bad scheme/target, wrong type, out-of-range option).
begin
  Disarm.transliterate("x", scheme: :klingon)
rescue Disarm::InvalidArgument => e   # also rescuable as Disarm::Error
  warn e.message
end

Stability

The Ruby gem version tracks the Rust crate and Python package numerically. The binding inherits the core's behavioural guarantees and limits verbatim — read the Threat Model before relying on it in a security context, and note that transliteration output is data-driven (Unicode tables, romanization standards) and can change across releases without being treated as a breaking change. Pin a version if you need byte-stable output.

Not surfaced in this binding (Python-only for now — compose the primitives, or use the Python package): the output encoders / encoding family (escape_html, percent_encode, strip_log_injection, detect_encoding, decode_to_utf8), the ml_normalize / strip_format / canonicalize_strict presets, and the list_profiles / list_langs / is_ascii helpers. In particular, strip_log_injection is a security-relevant control that is not available here — neutralize log-injection at the sink, or use the Python binding.