HCODX/AI Text Normalizer
Unicode evasion · Zero-width · Homoglyphs · Score 0-100

AI Text Normalizer & Unicode Evasion Detector

Paste any text — get a 0–100 evasion score, a list of detected techniques (zero-width chars, Unicode tag block, BiDi, Zalgo, Cyrillic / Greek / Armenian / Cherokee homoglyphs, math & full-width blocks), and clean normalized output. Useful for stripping prompt-injection payloads from LLM input, scrubbing AI-generated watermarks, and catching homoglyph phishing. 100% browser-side — your text never leaves the page.

Try a sample:
Input (original) 0 chars · 0 bytes
Output (normalized) 0 chars
Normalized text will appear here.
Evasion score
0
Clean
0Chars removed
0Techniques
0Flagged

Techniques detected

    No evasion detected — text is clean.
    Per-character breakdown (0) Expand
    #PositionTypeCode pointChar
    Use cases

    What you'll use this for

    LLM input scrubbing

    Strip hidden prompt-injection payloads from text before pasting into ChatGPT, Claude, Gemini, Copilot — Unicode tag block and zero-width chars are the most common smuggling vectors.

    AI watermark removal

    Some AI providers embed zero-width signatures into generated text. This tool flags and removes them — useful for academic and legal compliance reviews.

    Homoglyph phishing check

    Detect Cyrillic / Greek / Cherokee / Armenian letters that look like Latin in URLs, usernames, brand names, and lookalike-domain emails.

    Trojan Source (CVE-2021-42574)

    BiDi override attacks reorder displayed code while keeping different logic underneath. This tool flags every BiDi character in the source.

    Profanity-filter bypass

    Detect words obfuscated with zero-width insertions (h​ello), Zalgo stacking, full-width, or math-block alphabet swaps.

    Database / form sanitation

    Pre-clean user-submitted text before storage — strip invisible characters that confuse SQL queries, search indexes, and exact-match lookups.

    Step by step

    How to use the AI Text Normalizer

    1

    Paste your text

    Use the Paste button (reads from clipboard) or type directly into the left panel. Detection runs live with a 200 ms debounce.

    2

    Read the evasion score

    0–10 = clean. 11–30 = low concern (e.g. one stray BOM from a Word copy-paste). 31–60 = medium (multiple techniques). 61–100 = high (tag block, mass insertion, coordinated attack).

    3

    Inspect the techniques

    Each detected category — tag block, zero-width, homoglyph, BiDi, Zalgo — appears in the list with a count. Click the per-character breakdown for the exact position, type, and codepoint of every flagged character.

    4

    Copy or download

    The Output panel shows the cleaned text — all hidden chars removed, homoglyphs mapped to Latin, fancy spaces collapsed, NFKC applied. Copy to clipboard or download as .txt.

    About

    Under the hood

    The normalizer makes 18 separate passes over the text, ordered so each pass cleans up what the previous one couldn't:

    1. Unicode Tag Block (U+E0000–U+E01FF) — 512 invisible code points that mirror ASCII. Originally for language tagging (deprecated 2007), now used to smuggle ASCII payloads.
    2. Zero-width characters — ZWSP, ZWNJ, ZWJ, LRM, RLM, BOM, word joiner, invisible math operators.
    3. Variation selectors — VS1–VS16 plus the supplementary block (E0100–E01EF).
    4. BiDi controls — LRE, RLE, PDF, LRO, RLO, FSI, LRI, RLI, PDI, ALM. Underpins the Trojan Source attack.
    5. Deprecated format characters — symmetric/Arabic-form shaping selectors (U+206A–U+206F).
    6. Interlinear, object replacement, noncharacters (U+FFF9–U+FFFF).
    7. Hangul fillers — U+115F, U+1160, U+3164, U+FFA0. Look like spaces but match different equivalence classes.
    8. Khmer invisible vowels, Mongolian FVS, musical invisibles, shorthand formats.
    9. Braille blank (U+2800) — treated as space.
    10. Line / paragraph separators (U+2028 / U+2029) → newline.
    11. C0 / C1 control characters — stripped except tab / LF / CR.
    12. Fancy Unicode spaces — NBSP, en/em quad, thin, hair, narrow no-break, ideographic, all → regular space.
    13. Zalgo / stacked combining diacritics — collapses runs of combining marks to a single mark.
    14. NFKC normalization — math blocks (𝐀, 𝓐, 𝔸), full-width (A), superscripts (²), fractions (½), ligatures.
    15. Homoglyph map — Cyrillic, Greek, Armenian, Cherokee, Coptic, Latin extensions, full-width, IPA, letterlike symbols → Latin.
    16. Whitespace collapse — multiple spaces → one, trim ends.

    Detection emits a 0–100 evasion score weighted by severity (Unicode tag block scores 40, combining excess 35, zero-width 30, homoglyphs 30, math block 25, ...), plus a per-character breakdown with codepoint and type.

    FAQ

    Frequently asked questions

    Characters that look invisible (zero-width space, variation selectors, the Unicode tag block) or that look like Latin letters but are different code points (Cyrillic а vs Latin a). Attackers use these to smuggle prompt-injection instructions past LLM filters, bypass profanity detectors, and steal accounts via homoglyph-confusable URLs.

    Never. All detection and normalization runs locally in JavaScript. Open DevTools → Network to verify — no request carries your text. Safe for sensitive content.

    Tag block hits +40, Zalgo +35, zero-width / Cyrillic glyphs +30, variation selectors / math block +25, Greek glyphs / BiDi / interlinear +20, deprecated format / music invisible / shorthand +18, Hangul filler / Braille / full-width +15, C1 controls +12, soft hyphen / invisible format +10. Total capped at 100.

    No. Greek in a math paper, Cyrillic in Russian text, Cherokee in indigenous content, Coptic in liturgical text — all legitimate. The tool flags them for review and only converts on normalization. Use the score and per-character table to make a judgment call.

    Because combining diacritics are legitimately stacked in many scripts (Arabic, Hebrew, Indic, Vietnamese, IPA). The "excess" threshold fires only when combining marks exceed 15% of non-whitespace characters, which still allows benign usage but catches obvious Zalgo-bombs.

    Related tools

    You might also like