User Guide¶
Core concepts and usage for each feature area.
- Getting Started — install + quickstart for Python · Rust · Ruby
- Adversarial-Text Defense — TR39 visual confusable mapping vs phonetic transliteration, the XMR benchmark, and why it matters
- Transliteration — Unicode → ASCII with language profiles, plus reverse (Latin → native script)
- Slugification — URL-safe slug generation, drop-in python-slugify replacement
- Normalization — NFC / NFD / NFKC / NFKD Unicode normalization
- Confusable Detection — TR39 homoglyph detection and normalization
- Filename Sanitization — Cross-platform safe filenames
- Text Cleaning — Accent stripping, case folding, whitespace collapse
- Grapheme Clusters — User-perceived character counting, splitting, and truncation
- Text Pipeline — Composable, pre-compiled multi-step processing
- Language Support — Built-in profiles, auto-detection, custom profiles
- Abjad Scripts — Context-aware Arabic, Persian, and Hebrew with dictionary-based vowel restoration
- Language Detection — How
lang="auto"works: script identification, character-level discrimination, fail-safe fallbacks
- Policy Templates — Named institutional presets for libraries, web apps, ML, and more
- CLI — Command-line usage, piping, and shell integration
API Reference¶
Complete function signatures, parameters, and return types.
- Overview — API reference index
- Core Transforms —
transliterate,slugify,normalize,sanitize_filename,strip_accents,strip_zalgo,fold_case,collapse_whitespace,demojize,strip_bidi(all acceptstrorlist[str]) - Precompiled Pipelines —
security_clean,ml_normalize,catalog_key,display_clean,search_key,sort_key,normalize_user_input,PRESETS,get_pipeline,list_profiles - Classes —
Text,Slugifier,UniqueSlugifier,TextPipeline, compatibility aliases - Predicates —
detect_scripts,inspect_auto_lang,is_mixed_script,is_confusable,is_ascii,is_normalized,is_zalgo,is_suspicious_hostname - Grapheme Clusters —
grapheme_len,grapheme_split,grapheme_truncate - Encoding Detection —
detect_encoding,decode_to_utf8 - Language Profiles —
list_langs,register_lang,register_replacements - Enums & Types —
Script,NF,EmojiProvider, type aliases, language constants - Exceptions —
DisarmError
Reference¶
- Language Reference — All languages: codes, names, reference texts, and per-language transliteration rule tables
- Provenance — Standards and sources behind every transliteration mapping
Architecture¶
Internal design documentation for contributors and advanced users.
- Transliteration Engine — PHF lookup, language table chain, Indic virama handling
- Data Tables — TSV format, build.rs code generation, compile-time PHF
- Pipeline — TextPipeline internals, execution order, step bitflags
- Emoji Engine — Emoji detection, provider system, pure-Rust path
- Emoji Plugins — EmojiProvider protocol, custom providers
- Security — Confusable detection, hostname validation, bidi stripping
- Performance — Optimization strategies, PHF tables, batch amortization
- Testing & Guarantees — Test philosophy, property-based testing, security invariants, CI matrix
- Exhaustive Testing — Compile-time assertions, exhaustive domain coverage, stated invariants (I1–I7)
- Transliteration Comparison — Character-level diff vs Unidecode and anyascii
Benchmarks¶
- Performance Overview — Benchmark results: throughput and per-call speedups vs Unidecode, python-slugify, and pathvalidate
- Benchmark Suite — How to run benchmarks, Criterion and timeit configurations
Migration Guides¶
Parameter-compatible replacements for existing libraries.
- Migration Overview — Feature comparison matrix
- From Unidecode / text-unidecode — Drop-in
unidecode()alias - From python-slugify / awesome-slugify — Parameter-compatible
slugify() - From confusable_homoglyphs — Script detection and normalization
- From pathvalidate — Filename sanitization
- From anyascii — Language-aware transliteration
Other¶
- Limitations — Known constraints, edge cases, and design trade-offs