AI Text Normalizer & Unicode Evasion Detector
Paste any text — get a 0–100 evasion score, a list of detected techniques (zero-width chars, Unicode tag block, BiDi, Zalgo, Cyrillic / Greek / Armenian / Cherokee homoglyphs, math & full-width blocks), and clean normalized output. Useful for stripping prompt-injection payloads from LLM input, scrubbing AI-generated watermarks, and catching homoglyph phishing. 100% browser-side — your text never leaves the page.
Techniques detected
Per-character breakdown (0) Expand
| # | Position | Type | Code point | Char |
|---|
What you'll use this for
LLM input scrubbing
Strip hidden prompt-injection payloads from text before pasting into ChatGPT, Claude, Gemini, Copilot — Unicode tag block and zero-width chars are the most common smuggling vectors.
AI watermark removal
Some AI providers embed zero-width signatures into generated text. This tool flags and removes them — useful for academic and legal compliance reviews.
Homoglyph phishing check
Detect Cyrillic / Greek / Cherokee / Armenian letters that look like Latin in URLs, usernames, brand names, and lookalike-domain emails.
Trojan Source (CVE-2021-42574)
BiDi override attacks reorder displayed code while keeping different logic underneath. This tool flags every BiDi character in the source.
Profanity-filter bypass
Detect words obfuscated with zero-width insertions (hello), Zalgo stacking, full-width, or math-block alphabet swaps.
Database / form sanitation
Pre-clean user-submitted text before storage — strip invisible characters that confuse SQL queries, search indexes, and exact-match lookups.
How to use the AI Text Normalizer
Paste your text
Use the Paste button (reads from clipboard) or type directly into the left panel. Detection runs live with a 200 ms debounce.
Read the evasion score
0–10 = clean. 11–30 = low concern (e.g. one stray BOM from a Word copy-paste). 31–60 = medium (multiple techniques). 61–100 = high (tag block, mass insertion, coordinated attack).
Inspect the techniques
Each detected category — tag block, zero-width, homoglyph, BiDi, Zalgo — appears in the list with a count. Click the per-character breakdown for the exact position, type, and codepoint of every flagged character.
Copy or download
The Output panel shows the cleaned text — all hidden chars removed, homoglyphs mapped to Latin, fancy spaces collapsed, NFKC applied. Copy to clipboard or download as .txt.
Under the hood
The normalizer makes 18 separate passes over the text, ordered so each pass cleans up what the previous one couldn't:
- Unicode Tag Block (U+E0000–U+E01FF) — 512 invisible code points that mirror ASCII. Originally for language tagging (deprecated 2007), now used to smuggle ASCII payloads.
- Zero-width characters — ZWSP, ZWNJ, ZWJ, LRM, RLM, BOM, word joiner, invisible math operators.
- Variation selectors — VS1–VS16 plus the supplementary block (E0100–E01EF).
- BiDi controls — LRE, RLE, PDF, LRO, RLO, FSI, LRI, RLI, PDI, ALM. Underpins the Trojan Source attack.
- Deprecated format characters — symmetric/Arabic-form shaping selectors (U+206A–U+206F).
- Interlinear, object replacement, noncharacters (U+FFF9–U+FFFF).
- Hangul fillers — U+115F, U+1160, U+3164, U+FFA0. Look like spaces but match different equivalence classes.
- Khmer invisible vowels, Mongolian FVS, musical invisibles, shorthand formats.
- Braille blank (U+2800) — treated as space.
- Line / paragraph separators (U+2028 / U+2029) → newline.
- C0 / C1 control characters — stripped except tab / LF / CR.
- Fancy Unicode spaces — NBSP, en/em quad, thin, hair, narrow no-break, ideographic, all → regular space.
- Zalgo / stacked combining diacritics — collapses runs of combining marks to a single mark.
- NFKC normalization — math blocks (𝐀, 𝓐, 𝔸), full-width (A), superscripts (²), fractions (½), ligatures.
- Homoglyph map — Cyrillic, Greek, Armenian, Cherokee, Coptic, Latin extensions, full-width, IPA, letterlike symbols → Latin.
- Whitespace collapse — multiple spaces → one, trim ends.
Detection emits a 0–100 evasion score weighted by severity (Unicode tag block scores 40, combining excess 35, zero-width 30, homoglyphs 30, math block 25, ...), plus a per-character breakdown with codepoint and type.
Frequently asked questions
Characters that look invisible (zero-width space, variation selectors, the Unicode tag block) or that look like Latin letters but are different code points (Cyrillic а vs Latin a). Attackers use these to smuggle prompt-injection instructions past LLM filters, bypass profanity detectors, and steal accounts via homoglyph-confusable URLs.
Never. All detection and normalization runs locally in JavaScript. Open DevTools → Network to verify — no request carries your text. Safe for sensitive content.
Tag block hits +40, Zalgo +35, zero-width / Cyrillic glyphs +30, variation selectors / math block +25, Greek glyphs / BiDi / interlinear +20, deprecated format / music invisible / shorthand +18, Hangul filler / Braille / full-width +15, C1 controls +12, soft hyphen / invisible format +10. Total capped at 100.
No. Greek in a math paper, Cyrillic in Russian text, Cherokee in indigenous content, Coptic in liturgical text — all legitimate. The tool flags them for review and only converts on normalization. Use the score and per-character table to make a judgment call.
Because combining diacritics are legitimately stacked in many scripts (Arabic, Hebrew, Indic, Vietnamese, IPA). The "excess" threshold fires only when combining marks exceed 15% of non-whitespace characters, which still allows benign usage but catches obvious Zalgo-bombs.