Text Processing¶

disarm offers two ways to compose multiple transforms: the fluent Text builder for readability and one-off processing, and TextPipeline for high-throughput batch use.

Text builder (recommended)¶

Wrap a string in Text, chain methods, extract with .value or str(). Each method returns a new Text — immutable, like a Python str.

PythonRust

from disarm import Text

result = (Text("  Héllo   Straße  ")
    .normalize(form="NFC")
    .transliterate(lang="de")
    .fold_case()
    .collapse_whitespace()
    .value)
assert result == 'hello strasse'

Rust has no Text builder — chain the standalone operations (here via the DisarmStr extension trait and the Transliterate builder) instead.

use disarm::api::{self, NormalizationForm, Transliterate};
use disarm::DisarmStr;

let normalized = "  Héllo   Straße  ".normalize(NormalizationForm::Nfc);
let romanized = Transliterate::new().lang("de").run(&normalized);
let folded = romanized.fold_case();
let result = api::collapse_whitespace(&folded);
assert_eq!(result, "hello strasse"); // => "hello strasse"

Ordering is explicit¶

Steps execute in the order you chain them. This gives full control — there is no hidden reordering.

PythonRust

# Strip accents first, then transliterate the remainder
assert Text("café").strip_accents().transliterate().value == 'cafe'

# Transliterate first (accents handled by the transliteration table)
assert Text("café").transliterate().value == 'cafe'

use disarm::DisarmStr;

// Strip accents first, then transliterate the remainder
assert_eq!("café".strip_accents().transliterate(), "cafe"); // => "cafe"

// Transliterate first (accents handled by the transliteration table)
assert_eq!("café".transliterate(), "cafe"); // => "cafe"

Branching¶

Because each step returns a new Text, you can branch from a common base:

PythonRust

base = Text("Héllo Wörld").normalize(form="NFC")

ascii_version = base.transliterate().value
assert ascii_version == 'Hello World'
lowered = base.fold_case().value
assert lowered == 'héllo wörld'
slug = base.transliterate().slugify().value
assert slug == 'hello-world'

With no builder to hold the intermediate state, bind the common base to a variable and branch from it:

use disarm::api::{self, NormalizationForm, SlugConfig};
use disarm::DisarmStr;

let base = "Héllo Wörld".normalize(NormalizationForm::Nfc);

let ascii_version = base.transliterate();
assert_eq!(ascii_version, "Hello World"); // => "Hello World"
let lowered = base.fold_case();
assert_eq!(lowered, "héllo wörld"); // => "héllo wörld"
let slug = api::slugify(&base.transliterate(), &SlugConfig::new());
assert_eq!(slug, "hello-world"); // => "hello-world"

Available transforms¶

All 8 standalone transform functions are available as chainable methods:

Method	Description
`.normalize(form)`	Unicode normalization (NFC, NFD, NFKC, NFKD)
`.normalize_confusables()`	Replace homoglyphs with Latin equivalents
`.strip_accents()`	Remove combining diacritical marks
`.transliterate(lang=...)`	Unicode → ASCII transliteration
`.fold_case()`	Unicode case folding (ß→ss, İ→i̇, etc.)
`.collapse_whitespace()`	Fold whitespace runs to single spaces (fold-only)
`.slugify(separator=...)`	URL-safe slug generation
`.sanitize_filename()`	OS-safe filename sanitization

Predicates¶

Predicates return their native type and do not chain:

PythonRust

from disarm import Script

t = Text("hello мир")
assert t.is_mixed_script() == True
assert t.detect_scripts() == [Script.LATIN, Script.CYRILLIC]
assert t.is_ascii() == False

assert Text("café").transliterate().is_ascii() == True

use disarm::api;
use disarm::DisarmStr;

assert!("hello мир".is_mixed_script()); // => true
assert_eq!("hello мир".detect_scripts(), vec!["Latin", "Cyrillic"]); // => ["Latin", "Cyrillic"]
assert_eq!(api::is_ascii("hello мир"), false);

assert!(api::is_ascii(&"café".transliterate())); // => true

Extracting the result¶

PythonRust

t = Text("café").transliterate()

assert t.value == 'cafe'       # property access
assert str(t) == 'cafe'        # str() conversion
assert len(t) == 4
assert t == "cafe"             # compares with str directly

There is no wrapper to unwrap — each operation returns the string (a Cow<str> or String) directly.

use disarm::DisarmStr;

let t = "café".transliterate();

assert_eq!(t, "cafe");   // => "cafe"
assert_eq!(t.len(), 4);  // => 4 (bytes)

TextPipeline (batch processing)¶

TextPipeline is a pre-compiled, reusable processor. Configure once at construction, call repeatedly. Operations execute in a fixed optimal order regardless of how you specify them.

Use this when processing large datasets where the same transform chain applies to every item.

from disarm import TextPipeline

pipe = TextPipeline(
    normalize="NFC",
    confusables=True,
    strip_accents=True,
    fold_case=True,
    collapse_whitespace=True,
)

# Call repeatedly — construction cost amortized
for text in large_dataset:
    cleaned = pipe(text)

Parameters¶

Parameter	Type	Default	Description
`normalize`	`str \\| None`	`None`	Normalization form: `"NFC"`, `"NFD"`, `"NFKC"`, `"NFKD"`
`transliterate`	`bool`	`False`	Convert to ASCII via transliteration tables
`lang`	`str \\| None`	`None`	Language profile for transliteration. Use `"auto"` to detect from script.
`confusables`	`bool`	`False`	Normalize confusable homoglyphs to Latin
`strip_accents`	`bool`	`False`	Remove diacritical marks
`fold_case`	`bool`	`False`	Unicode case folding
`collapse_whitespace`	`bool`	`False`	Normalize whitespace to single spaces
`strip_control`	`bool \\| None`	`None`	Strip control characters. Defaults to `True` when `collapse_whitespace=True`, `False` otherwise. Can be used independently.
`strip_zero_width`	`bool \\| None`	`None`	Strip zero-width characters. Defaults to `True` when `collapse_whitespace=True`, `False` otherwise. Can be used independently.

Fixed execution order¶

Operations always execute in this order, regardless of how you specify them:

Normalize — Unicode normalization
Confusables — Replace homoglyphs
Demojize — Expand emoji to text
Strip accents — Remove combining marks
Transliterate — Convert to ASCII
Fold case — Case folding
Strip control — Remove control characters
Strip zero-width — Remove zero-width/invisible characters
Collapse whitespace — Whitespace normalization

When to use which¶

Scenario	Use
One-off text processing	`Text` builder
Ad-hoc chains with varying steps	`Text` builder
Processing a large dataset uniformly	`TextPipeline`
Need explicit control over step ordering	`Text` builder
Batch ETL / search index normalization	`TextPipeline`