Predicates¶
Functions that inspect text and return boolean or structured results without modifying the input.
detect_scripts¶
detect_scripts ¶
detect_scripts(text: str) -> list[Script]
Return the set of Unicode scripts present in text, in order of first appearance.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> detect_scripts("Hello")
[Script.LATIN]
>>> detect_scripts("Hello Мир")
[Script.LATIN, Script.CYRILLIC]
inspect_auto_lang¶
inspect_auto_lang ¶
inspect_auto_lang(text: str) -> dict[str, str | list[str] | None]
Inspect how lang="auto" would resolve for the given text.
Use this to audit or log the detection decision made by the three-stage auto-detection pipeline.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> inspect_auto_lang("Київ")["chosen_lang"]
'uk'
>>> inspect_auto_lang("Москва")["reason"]
'script_default'
from disarm import inspect_auto_lang
inspect_auto_lang("Київ")
# {'script': 'Cyrillic', 'chosen_lang': 'uk', 'reason': 'discriminator', 'discriminators_hit': ['ї']}
inspect_auto_lang("Москва")
# {'script': 'Cyrillic', 'chosen_lang': 'ru', 'reason': 'script_default', 'discriminators_hit': []}
inspect_auto_lang("hello")
# {'script': None, 'chosen_lang': None, 'reason': 'no_detection', 'discriminators_hit': []}
See Language Detection for details.
is_mixed_script¶
is_mixed_script ¶
is_mixed_script(text: str) -> bool
True if text contains characters from more than one Unicode script.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> is_mixed_script("Hello")
False
>>> is_mixed_script("Hello Мир") # Latin + Cyrillic
True
has_bidi_conflict¶
has_bidi_conflict ¶
has_bidi_conflict(text: str) -> bool
True if text mixes strong left-to-right and strong right-to-left characters.
This is the precondition for Unicode Bidi display-reordering (UAX #9) — the
structural signal behind "BiDi Swap"-style spoofs, where an LTR brand label
sits beside an RTL domain (e.g. "varonis.com.ו.קום"). Unlike a
bidi-override (U+202x) check, it fires on the real letters: Latin /
Cyrillic / Greek / CJK are left-to-right; Hebrew / Arabic / Syriac / Thaana /
N'Ko are right-to-left; digits, punctuation and combining marks are neutral
and never create a conflict on their own.
A False result is not a safety guarantee.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> has_bidi_conflict("hello")
False
>>> has_bidi_conflict("helloא") # Latin + Hebrew
True
is_confusable¶
is_confusable ¶
is_confusable(text: str, *, target_script: str = 'latin', greedy: bool | None = None, preferred_aliases: list[str] | None = None) -> bool
True if text contains characters confusable with target-script characters.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
>>> is_confusable("pаypal") # Cyrillic а looks like Latin a
True
>>> is_confusable("paypal") # all genuine Latin
False
is_ascii¶
is_ascii ¶
is_ascii(text: str) -> bool
True if all characters are in U+0000–U+007F.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> is_ascii("hello 123")
True
>>> is_ascii("café")
False
is_normalized¶
is_normalized ¶
is_normalized(text: str, *, form: NormalizationForm = 'NFC') -> bool
True if text is already in the specified normalization form.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> is_normalized("café") # NFC by default
True
>>> is_normalized("e\u0301", form="NFC") # NFD decomposed
False
is_zalgo¶
is_zalgo ¶
is_zalgo(text: str, *, threshold: int = 3) -> bool
Detect whether text contains zalgo-style combining mark abuse.
Returns True if any base character has more than threshold
consecutive combining marks in NFD decomposition.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> is_zalgo("café")
False
>>> is_zalgo("Việt Nam")
False
>>> is_zalgo("ḧ̸̡̢̧̛̗̱̜̼̯̞̙́̑̾̊̿̏̒̓̕ě̵̢̧̛̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕l̸̡̢̧̛̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕l̸̡̢̧̛̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕ơ̵̢̧̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕")
True
from disarm import is_zalgo
is_zalgo("café") # False (1 combining mark — normal)
is_zalgo("Việt Nam") # False (2 combining marks — normal)
# Zalgo: 'a' with 20 stacked combining graves
is_zalgo("a" + "\u0300" * 20) # True
is_suspicious_hostname¶
is_suspicious_hostname ¶
is_suspicious_hostname(hostname: str) -> tuple[bool, HostnameAnalysis]
Flag a hostname as suspicious for Unicode homoglyph spoofing.
Returns (suspicious, analysis) where analysis is a
HostnameAnalysis with attributes:
suspicious: bool — True if a problem was detected (mixed-script, a bundled-table confusable, or a bidi-direction conflict).scripts: list[str] — Unicode scripts found across all labels.mixed_script: bool — True if any single label contains more than one script.has_confusables: bool — True if confusable homoglyphs found.bidi_conflict: bool — True if the decoded hostname mixes strong left-to-right and strong right-to-left characters (the "BiDi Swap" reorder precondition). Folded intosuspicious.cross_label_script: bool — True if the labels span more than one distinct script. Broader and noisier thanbidi_conflict(it fires on benign IDN ccTLDs likegoogle.рф), so it is not folded intosuspicious; exposed for caller policy.label_scripts: list[list[str]] — per-label resolved scripts, left to right.canonical: str — Latin-normalized form of the hostname.
A hostname is flagged suspicious if any single label is mixed-script
(draws on more than one Unicode script, excluding Common/Inherited),
contains confusable homoglyphs, or has a bidi-direction conflict
(bidi_conflict). The mixed-script rule is conservative and fails closed:
it flags benign combinations such as Latin+CJK as well as spoofing ones, so a
caller wanting a more permissive policy can inspect the mixed_script and
scripts fields and decide for itself.
A False (not-suspicious) result is not a safety guarantee. It means
only that no mixed-script label and no confusable from the bundled TR39
table was found. Whole-script spoofs that use no bundled-table confusable,
and confusables outside the bundled table, are out of scope (see the Threat
Model) and report not-suspicious. Base allow/deny decisions on the granular
findings plus your own policy — a detector can attest the presence of a
problem, never the absence of all problems.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> suspicious, analysis = is_suspicious_hostname("google.com")
>>> suspicious
False
>>> analysis.canonical
'google.com'
HostnameAnalysis¶
The second element of the tuple returned by is_suspicious_hostname():
| Attribute | Type | Description |
|---|---|---|
suspicious |
bool |
True if a problem was detected (mixed-script or bundled-table confusable) |
scripts |
list[str] |
Unicode scripts found across all labels |
mixed_script |
bool |
True if any single label contains more than one script |
has_confusables |
bool |
True if confusable homoglyphs found |
canonical |
str |
Latin-normalized form of the hostname |
from disarm import is_suspicious_hostname
suspicious, analysis = is_suspicious_hostname("google.com")
# suspicious = False, analysis.canonical = "google.com"
suspicious, analysis = is_suspicious_hostname("gооgle.com") # Cyrillic о's
# suspicious = True, analysis.mixed_script = True, analysis.has_confusables = True
A hostname is flagged suspicious if any single label is mixed-script (draws on more than one Unicode script) or contains confusable homoglyphs. A not-suspicious result is not a safety guarantee — whole-script spoofs with no bundled-table confusable, and confusables outside the bundled table, are out of scope (see Threat Model); branch on the granular fields plus your own policy.