index nav

User Guide¶

Core concepts and usage for each feature area.

Getting Started — install + quickstart for Python · Rust · Ruby
Adversarial-Text Defense — TR39 visual confusable mapping vs phonetic transliteration, the XMR benchmark, and why it matters
Transliteration — Unicode → ASCII with language profiles, plus reverse (Latin → native script)
Slugification — URL-safe slug generation, drop-in python-slugify replacement
Normalization — NFC / NFD / NFKC / NFKD Unicode normalization
Confusable Detection — TR39 homoglyph detection and normalization
Filename Sanitization — Cross-platform safe filenames
Text Cleaning — Accent stripping, case folding, whitespace collapse
Grapheme Clusters — User-perceived character counting, splitting, and truncation
Text Pipeline — Composable, pre-compiled multi-step processing
Language Support — Built-in profiles, auto-detection, custom profiles
Abjad Scripts — Context-aware Arabic, Persian, and Hebrew with dictionary-based vowel restoration
Language Detection — How lang="auto" works: script identification, character-level discrimination, fail-safe fallbacks

Policy Templates — Named institutional presets for libraries, web apps, ML, and more
CLI — Command-line usage, piping, and shell integration

Complete function signatures, parameters, and return types.

Overview — API reference index
Core Transforms — transliterate, slugify, normalize, sanitize_filename, strip_accents, strip_zalgo, fold_case, collapse_whitespace, demojize, strip_bidi (all accept str or list[str])
Precompiled Pipelines — canonicalize, ml_normalize, catalog_key, strip_format, search_key, sort_key, canonicalize_strict, PRESETS, get_pipeline, list_profiles
Classes — Text, Slugifier, UniqueSlugifier, TextPipeline, compatibility aliases
Predicates — detect_scripts, inspect_auto_lang, is_mixed_script, is_confusable, is_ascii, is_normalized, is_zalgo, is_suspicious_hostname
Grapheme Clusters — grapheme_len, grapheme_split, grapheme_truncate
Encoding Detection — detect_encoding, decode_to_utf8
Language Profiles — list_langs, register_lang, register_replacements
Enums & Types — Script, NF, EmojiProvider, type aliases, language constants
Exceptions — DisarmError

Language Reference — All languages: codes, names, reference texts, and per-language transliteration rule tables
Provenance — Standards and sources behind every transliteration mapping

Internal design documentation for contributors and advanced users.

Transliteration Engine — PHF lookup, language table chain, Indic virama handling
Data Tables — TSV format, build.rs code generation, compile-time PHF
Pipeline — TextPipeline internals, execution order, step bitflags
Emoji Engine — Emoji detection, provider system, pure-Rust path
Emoji Plugins — EmojiProvider protocol, custom providers
Security — Confusable detection, hostname validation, bidi stripping
Performance — Optimization strategies, PHF tables, batch amortization
Testing & Guarantees — Test philosophy, property-based testing, security invariants, CI matrix
Exhaustive Testing — Compile-time assertions, exhaustive domain coverage, stated invariants (I1–I7)
Transliteration Comparison — Character-level diff vs Unidecode and anyascii

Performance Overview — Benchmark results: throughput and per-call speedups vs Unidecode, python-slugify, and pathvalidate
Benchmark Suite — How to run benchmarks, Criterion and timeit configurations

Parameter-compatible replacements for existing libraries.