HCODX/Text Similarity
100% browser-based · Levenshtein / Jaccard / cosine

Text Similarity

Compute similarity between two texts. Levenshtein ratio (character-level edit distance), Jaccard (shared word set), and cosine (word-vector overlap). All three runs locally.

Text A
Text B
Similarity options
Levenshtein
Jaccard
Cosine
Status
Ready
Example

Two texts, three scores

Each metric measures a different kind of similarity. Use the one that matches your question — typos? content overlap? weighted vocabulary?

Text A
The quick brown fox jumps over the lazy dog.
Text B
A quick brown dog runs over the lazy fox.
Use cases

What you'll use this for

Wherever you need a quantitative answer to "how close are these two strings?"

Dedupe similar entries

Find near-duplicates in a list of titles, snippets, or descriptions.

Plagiarism detection

Spot suspiciously similar essays or code comments.

Prompt drift tracking

Detect when an LLM prompt has been edited away from its original version.

Version compare

Quick numeric answer to "how different is v2 from v1?"

Step by step

How to compare two texts

1

Paste two texts

Text A on the left, Text B on the right.

2

Read three scores

Levenshtein (char-level), Jaccard (word-set), and cosine (word-frequency vector). Higher = more similar.

3

Pick the metric that fits

Typo-level: Levenshtein. Bag-of-words overlap: Jaccard. Weighted vocabulary: cosine.

4

Iterate

Edit either text and watch the scores update live.

FAQ

Frequently asked questions

It depends. Levenshtein is best for typo-level differences — it counts the edit operations. Jaccard captures content overlap regardless of word order or repetition. Cosine weights repeated words and works well for longer texts where vocabulary balance matters.

Yes. Runs entirely in your browser, no signup.

Jaccard and cosine tokenize with a basic word regex (\w+) and lowercase before comparing. Levenshtein operates on raw characters — case-sensitive, whitespace-sensitive.

Jaccard and cosine are case-insensitive. Levenshtein is case-sensitive — lowercase both inputs first if you want a case-insensitive edit distance.

Similarity is symmetric — A↔B equals B↔A. There's no "reverse" because both texts are first-class inputs. Use the swap button to flip them.

About

About text similarity

"Text similarity" isn't one thing — it's a family of measures, each appropriate for a different question.

Levenshtein ratio

  • Operates on raw characters. Counts insertions, deletions, and substitutions needed to transform A into B.
  • Reported as 1 - (edits / max-length), so 100% = identical, 0% = nothing in common.
  • Best for short strings (titles, IDs, file names, single sentences).

Jaccard similarity

  • Tokenizes both texts into word sets, then computes |A∩B| / |A∪B|.
  • Word order and repetition don't matter — only whether a word is present.
  • Best for content overlap on medium-length texts (paragraphs, descriptions).

Cosine similarity

  • Tokenizes both texts into term-frequency vectors, then computes the cosine of the angle between them.
  • Repeated words matter — texts with similar vocabulary balance score higher.
  • Best for longer texts where vocabulary distribution is meaningful.
Related

Related tools