Does this tool send my text anywhere?

No. The normalization runs entirely in your browser. Open DevTools → Network to verify — no request leaves the page with your text. Safe for sensitive content.

What does the evasion score mean?

It's a 0–100 confidence rating that the text contains intentionally-hidden evasion. 0 = clean ASCII / normal Unicode. 1–30 = low (probably benign — e.g. one stray BOM). 31–60 = medium (multiple techniques present). 61–100 = high (tag block, mass zero-width insertion, or coordinated multi-technique attack).

What is the Unicode tag block?

U+E0000 to U+E01FF — 512 code points that mirror ASCII but render as completely invisible. Originally for language tagging (deprecated in 2007), they're now used to smuggle ASCII payloads inside otherwise innocent-looking text. This tool always strips them.

Unicode evasion · Zero-width · Homoglyphs · Score 0-100

AI Text Normalizer & Unicode Evasion Detector

Paste any text — get a 0–100 evasion score, a list of detected techniques (zero-width chars, Unicode tag block, BiDi, Zalgo, Cyrillic / Greek / Armenian / Cherokee homoglyphs, math & full-width blocks), and clean normalized output. Useful for stripping prompt-injection payloads from LLM input, scrubbing AI-generated watermarks, and catching homoglyph phishing. 100% browser-side — your text never leaves the page.

Input (original) 0 chars · 0 bytes

Output (normalized) 0 chars

Normalized text will appear here.

Evasion score

Clean

0Chars removed

0Techniques

0Flagged

Techniques detected

No evasion detected — text is clean.

Per-character breakdown (0) Expand

#	Position	Type	Code point	Char

Use cases

What you'll use this for

LLM input scrubbing

Strip hidden prompt-injection payloads from text before pasting into ChatGPT, Claude, Gemini, Copilot — Unicode tag block and zero-width chars are the most common smuggling vectors.

AI watermark removal

Some AI providers embed zero-width signatures into generated text. This tool flags and removes them — useful for academic and legal compliance reviews.

Homoglyph phishing check

Detect Cyrillic / Greek / Cherokee / Armenian letters that look like Latin in URLs, usernames, brand names, and lookalike-domain emails.

Trojan Source (CVE-2021-42574)

BiDi override attacks reorder displayed code while keeping different logic underneath. This tool flags every BiDi character in the source.

Profanity-filter bypass

Detect words obfuscated with zero-width insertions (hello), Zalgo stacking, full-width, or math-block alphabet swaps.

Database / form sanitation

Pre-clean user-submitted text before storage — strip invisible characters that confuse SQL queries, search indexes, and exact-match lookups.

Step by step

How to use the AI Text Normalizer

Paste your text

Use the Paste button (reads from clipboard) or type directly into the left panel. Detection runs live with a 200 ms debounce.

Read the evasion score

0–10 = clean. 11–30 = low concern (e.g. one stray BOM from a Word copy-paste). 31–60 = medium (multiple techniques). 61–100 = high (tag block, mass insertion, coordinated attack).

Inspect the techniques

Each detected category — tag block, zero-width, homoglyph, BiDi, Zalgo — appears in the list with a count. Click the per-character breakdown for the exact position, type, and codepoint of every flagged character.

Copy or download

The Output panel shows the cleaned text — all hidden chars removed, homoglyphs mapped to Latin, fancy spaces collapsed, NFKC applied. Copy to clipboard or download as .txt.

About

Under the hood

The normalizer makes 18 separate passes over the text, ordered so each pass cleans up what the previous one couldn't:

Unicode Tag Block (U+E0000–U+E01FF) — 512 invisible code points that mirror ASCII. Originally for language tagging (deprecated 2007), now used to smuggle ASCII payloads.
Zero-width characters — ZWSP, ZWNJ, ZWJ, LRM, RLM, BOM, word joiner, invisible math operators.
Variation selectors — VS1–VS16 plus the supplementary block (E0100–E01EF).
BiDi controls — LRE, RLE, PDF, LRO, RLO, FSI, LRI, RLI, PDI, ALM. Underpins the Trojan Source attack.
Deprecated format characters — symmetric/Arabic-form shaping selectors (U+206A–U+206F).
Interlinear, object replacement, noncharacters (U+FFF9–U+FFFF).
Hangul fillers — U+115F, U+1160, U+3164, U+FFA0. Look like spaces but match different equivalence classes.
Khmer invisible vowels, Mongolian FVS, musical invisibles, shorthand formats.
Braille blank (U+2800) — treated as space.
Line / paragraph separators (U+2028 / U+2029) → newline.
C0 / C1 control characters — stripped except tab / LF / CR.
Fancy Unicode spaces — NBSP, en/em quad, thin, hair, narrow no-break, ideographic, all → regular space.
Zalgo / stacked combining diacritics — collapses runs of combining marks to a single mark.
NFKC normalization — math blocks (𝐀, 𝓐, 𝔸), full-width (Ａ), superscripts (²), fractions (½), ligatures.
Homoglyph map — Cyrillic, Greek, Armenian, Cherokee, Coptic, Latin extensions, full-width, IPA, letterlike symbols → Latin.
Whitespace collapse — multiple spaces → one, trim ends.

Detection emits a 0–100 evasion score weighted by severity (Unicode tag block scores 40, combining excess 35, zero-width 30, homoglyphs 30, math block 25, ...), plus a per-character breakdown with codepoint and type.

FAQ

Frequently asked questions

What is hidden Unicode evasion?

Characters that look invisible (zero-width space, variation selectors, the Unicode tag block) or that look like Latin letters but are different code points (Cyrillic а vs Latin a). Attackers use these to smuggle prompt-injection instructions past LLM filters, bypass profanity detectors, and steal accounts via homoglyph-confusable URLs.

Does my text leave my browser?

Never. All detection and normalization runs locally in JavaScript. Open DevTools → Network to verify — no request carries your text. Safe for sensitive content.

What does the score actually weight?

Tag block hits +40, Zalgo +35, zero-width / Cyrillic glyphs +30, variation selectors / math block +25, Greek glyphs / BiDi / interlinear +20, deprecated format / music invisible / shorthand +18, Hangul filler / Braille / full-width +15, C1 controls +12, soft hyphen / invisible format +10. Total capped at 100.

Are homoglyphs always malicious?

No. Greek in a math paper, Cyrillic in Russian text, Cherokee in indigenous content, Coptic in liturgical text — all legitimate. The tool flags them for review and only converts on normalization. Use the score and per-character table to make a judgment call.

Why does Zalgo only score 35, not 100?

Because combining diacritics are legitimately stacked in many scripts (Arabic, Hebrew, Indic, Vietnamese, IPA). The "excess" threshold fires only when combining marks exceed 15% of non-whitespace characters, which still allows benign usage but catches obvious Zalgo-bombs.

Related tools