User Guide

Core concepts and usage for each feature area.

  • Getting Started — install + quickstart for Python · Rust · Ruby
  • Adversarial-Text Defense — TR39 visual confusable mapping vs phonetic transliteration, the XMR benchmark, and why it matters
  • Transliteration — Unicode → ASCII with language profiles, plus reverse (Latin → native script)
  • Slugification — URL-safe slug generation, drop-in python-slugify replacement
  • Normalization — NFC / NFD / NFKC / NFKD Unicode normalization
  • Confusable Detection — TR39 homoglyph detection and normalization
  • Filename Sanitization — Cross-platform safe filenames
  • Text Cleaning — Accent stripping, case folding, whitespace collapse
  • Grapheme Clusters — User-perceived character counting, splitting, and truncation
  • Text Pipeline — Composable, pre-compiled multi-step processing
  • Language Support — Built-in profiles, auto-detection, custom profiles
  • Abjad Scripts — Context-aware Arabic, Persian, and Hebrew with dictionary-based vowel restoration
  • Language Detection — How lang="auto" works: script identification, character-level discrimination, fail-safe fallbacks

  • Policy Templates — Named institutional presets for libraries, web apps, ML, and more
  • CLI — Command-line usage, piping, and shell integration

API Reference

Complete function signatures, parameters, and return types.

  • Overview — API reference index
  • Core Transformstransliterate, slugify, normalize, sanitize_filename, strip_accents, strip_zalgo, fold_case, collapse_whitespace, demojize, strip_bidi (all accept str or list[str])
  • Precompiled Pipelinessecurity_clean, ml_normalize, catalog_key, display_clean, search_key, sort_key, normalize_user_input, PRESETS, get_pipeline, list_profiles
  • ClassesText, Slugifier, UniqueSlugifier, TextPipeline, compatibility aliases
  • Predicatesdetect_scripts, inspect_auto_lang, is_mixed_script, is_confusable, is_ascii, is_normalized, is_zalgo, is_suspicious_hostname
  • Grapheme Clustersgrapheme_len, grapheme_split, grapheme_truncate
  • Encoding Detectiondetect_encoding, decode_to_utf8
  • Language Profileslist_langs, register_lang, register_replacements
  • Enums & TypesScript, NF, EmojiProvider, type aliases, language constants
  • ExceptionsDisarmError

Reference

  • Language Reference — All languages: codes, names, reference texts, and per-language transliteration rule tables
  • Provenance — Standards and sources behind every transliteration mapping

Architecture

Internal design documentation for contributors and advanced users.

  • Transliteration Engine — PHF lookup, language table chain, Indic virama handling
  • Data Tables — TSV format, build.rs code generation, compile-time PHF
  • Pipeline — TextPipeline internals, execution order, step bitflags
  • Emoji Engine — Emoji detection, provider system, pure-Rust path
  • Emoji Plugins — EmojiProvider protocol, custom providers
  • Security — Confusable detection, hostname validation, bidi stripping
  • Performance — Optimization strategies, PHF tables, batch amortization
  • Testing & Guarantees — Test philosophy, property-based testing, security invariants, CI matrix
  • Exhaustive Testing — Compile-time assertions, exhaustive domain coverage, stated invariants (I1–I7)
  • Transliteration Comparison — Character-level diff vs Unidecode and anyascii

Benchmarks

  • Performance Overview — Benchmark results: throughput and per-call speedups vs Unidecode, python-slugify, and pathvalidate
  • Benchmark Suite — How to run benchmarks, Criterion and timeit configurations

Migration Guides

Parameter-compatible replacements for existing libraries.


Other

  • Limitations — Known constraints, edge cases, and design trade-offs