Anomaly Detection

has_anomalies / inspect_anomalies flag text that carries out-of-place characters disguising a real word — a cross-script homoglyph, a bidi-direction conflict, leet, a single-letter segmentation, a zero-width / bidi control, or zalgo. Like is_suspicious_hostname, the detector reports a technical fact and leaves the malicious-or-not judgement to the caller — it never claims intent.

Defensive publication

This detector is described publicly as prior art so the method stays freely usable and cannot be patented by others. See issue #389 for the dated record.

Detected classes

Six branches fire, in order; the first four need no lexicon and are script-agnostic, so they port across writing systems.

Kind Fires on Spared (false-positive guards)
invisible a zero-width / formatting codepoint inside a Latin word emoji ZWJ sequences; ZWJ/ZWNJ joiners in Indic & Arabic; soft hyphen
bidi an LRO/RLO override anywhere, or an isolate inside a majority-Latin token (Trojan Source) bare directional marks; LRE..PDF embeddings (RTL text, hashtags)
zalgo excessive stacked combining marks ordinary accents
mixed_script Latin combined with Cyrillic or Greek in one token CJK / Thai / kaomoji; legitimate unit symbols (, µF)
bidi_mixed one token mixes strong left-to-right and strong right-to-left letters (varonisו), which can visually reorder ("BiDi Swap") — no U+202x override (that is bidi) single-direction text (all-LTR or all-RTL); digits are neutral
leet every out-of-place char substitutes a letter and the result is a common word (fr33free) a literal number that maps to no letter (win32, Power5, 21st, 3pm)
segmentation dense separators splitting single letters into a real word (v.i.a.g.r.a) multi-letter parts (6-foot-6); a lone separator (e-mail)

The leet and segmentation branches take a caller-supplied lexicon — a set of common words for the language being protected. The defining rule: a real leet attack substitutes a letter, whereas win32 carries a literal number that maps to no letter, so requiring every out-of-place character to be a real letter-substitution that yields a common word rejects the literals.

Usage

from disarm import has_anomalies, inspect_anomalies

words = {"free", "paypal"}

# leet: "fr33" decodes to "free"
assert has_anomalies("get fr33 now", words)
# a literal number is not a substitution, so "win32" is spared
assert not has_anomalies("the win32 api", words)

report = inspect_anomalies("log in to paypаl", {"paypal"})  # Cyrillic а
assert report.anomalous
assert report.kinds == ["mixed_script"]
assert report.findings[0].kind == "mixed_script"
use disarm::api::{self, AnomalyKind};
use std::collections::HashSet;

let words: HashSet<String> = ["free", "paypal"].iter().map(|s| s.to_string()).collect();

assert!(api::has_anomalies("get fr33 now", &words));
assert!(!api::has_anomalies("the win32 api", &words));

let report = api::inspect_anomalies("log in to paypаl", &words);
assert!(report.anomalous);
assert_eq!(report.kinds, vec![AnomalyKind::MixedScript]);
require "disarm"

# the lexicon is a common-word collection (Array or Set)
Disarm.has_anomalies?("get fr33 now", ["free"])  # => true
Disarm.has_anomalies?("the win32 api", ["free"]) # => false

Disarm.inspect_anomalies("log in to paypаl", ["paypal"])[:kinds] # => ["mixed_script"]
import { hasAnomalies, inspectAnomalies } from 'disarm'

// the lexicon is a Set or array of common words
hasAnomalies('get fr33 now', ['free'])  // => true
hasAnomalies('the win32 api', ['free']) // => false

inspectAnomalies('log in to paypаl', ['paypal']).kinds // => ['mixed_script']

The report

inspect_anomalies returns a report with anomalous, kinds (the anomaly kinds that fired, in first-appearance order), findings, and reason (the first finding's plain-language sentence). Each finding carries the offending kind, token, byte start/end span, detail (the codepoint, the scripts, or the decoded word), and its own reason.

A False result is not a safety guarantee — it means only that none of the six branches fired on the lexicon you supplied. Compose this with your own policy, as you would the hostname analysis.