How the similarity checker works
This tool measures how similar two texts are by analysing their vocabulary overlap — not by comparing them to any external source. It runs entirely in your browser using client-side JavaScript, which means your text is never sent to any server. You paste two documents, click "Check Similarity," and get a percentage score within milliseconds.
The underlying algorithm has three stages. First, both texts are tokenised: split into individual words, converted to lowercase, stripped of punctuation, and filtered to remove common stop words (words like "the," "is," "and" that appear in almost all text and carry no discriminating information). The remaining meaningful tokens form the vocabulary for comparison.
Second, each text is converted into a TF-IDF weighted vector. Term Frequency (TF) measures how often a word appears in a document relative to the total word count. This weighting ensures that words appearing many times in a text are treated as more representative of that text's content. The result is a numerical fingerprint of each document's vocabulary.
Third, the two vectors are compared using cosine similarity — a standard mathematical technique that measures the angle between two vectors in high-dimensional space. A cosine similarity of 1.0 (100%) means the two documents have an identical vocabulary distribution. A score of 0% means they share no meaningful words at all. Real documents covering the same topic typically land in the 15–50% range even when independently written, because they naturally use similar subject-specific vocabulary.
The tool also runs an additional analysis to surface the most common shared multi-word phrases (bigrams and trigrams) between the two texts. If "machine learning model" appears in both documents, that phrase will appear in the "Top matching phrases" section. These phrase matches are more diagnostic than the overall score — a high overall score with no phrase matches suggests topical overlap, not copying. A lower overall score with multiple specific phrase matches is more suspicious.
What is TF-IDF? A plain-language explanation
TF-IDF stands for Term Frequency–Inverse Document Frequency. It is one of the oldest and most reliable techniques in information retrieval, originally developed in the 1970s for library science and document classification. Despite being decades old, it remains the backbone of many search engines, recommendation systems, and content-matching tools because it is fast, explainable, and requires no training data or machine learning models.
Term Frequency (TF) is simple: how often does a word appear in a document, relative to the document's total length? A 500-word article that uses the word "mortgage" 10 times has a TF of 2% for "mortgage." This is higher than the same word appearing once in a 2,000-word article (0.05% TF). Higher TF means the word is more central to that document's content.
Inverse Document Frequency (IDF) adds a correction for words that appear everywhere. The word "important" might appear frequently in many documents — its high TF would be misleading if we used TF alone, because "important" doesn't distinguish one document from another. IDF down-weights words that appear in many documents and up-weights words that appear in only a few. In a collection of medical papers, "patient" would have low IDF (it's everywhere), while "glomerulonephritis" would have high IDF (only appears in very specific contexts).
In this tool, we use a simplified version: because we are comparing exactly two documents (not a large corpus), we focus on TF rather than corpus-level IDF. The stop-word filter performs a similar function to IDF — it removes common words that would add noise to the comparison. This is why the similarity score measures meaningful vocabulary overlap, not raw word-by-word matching.
| Similarity range | Typical meaning | Action |
|---|---|---|
| 0–19% | Low — documents cover different topics or use very different vocabulary | No concern |
| 20–39% | Moderate — documents share a topic or domain; natural overlap | Review phrase matches |
| 40–59% | Notable — significant vocabulary overlap; may be same source material or paraphrasing | Investigate further |
| 60–79% | High — strong structural and vocabulary similarity; likely paraphrase or close rewrite | Manual review needed |
| 80–100% | Very high — near-identical content; one text is likely derived from the other | Treat as a match |
Important limitations — read before drawing conclusions
This tool is honest about what it can and cannot do. Understanding its limitations will help you use it correctly and avoid false positives (flagging innocent similarity) or false negatives (missing real issues).
It does not check against the internet. This is the most important limitation. This tool has no network access. It cannot compare your text against published web pages, academic databases, news articles, or any external source. The 100% of the score reflects similarity between the two texts you provided — nothing else. Two completely original documents on the same topic can score 30–50% simply because they use similar subject-specific vocabulary, even if neither has seen the other.
It does not detect paraphrasing well. TF-IDF is good at detecting vocabulary overlap but weak at detecting sophisticated paraphrasing. If a writer rewrites every sentence using synonyms while preserving all the ideas and structure, the vocabulary vectors will diverge and the score will drop — even though the intellectual content is substantially borrowed. Human review (or a semantic similarity model like Originality.AI) is needed to catch heavy paraphrasing.
It is sensitive to document length. Very short texts (under 100 words) produce unreliable scores because there are not enough tokens for statistical significance. A 50-word text that happens to share 10 words with another 50-word text will score very high, even if the overlap is coincidental. Minimum recommended length for a meaningful score is approximately 150–200 words per document.
Technical and domain-specific texts score higher naturally. Two independently written articles about "Python decorators" or "compound interest calculations" will share significant vocabulary — @property, __init__, self, rate, principal — regardless of copying. Apply a higher threshold before flagging technical documents. The phrase-matching feature is more diagnostic here: shared generic vocabulary is expected; shared specific examples or code snippets are suspicious.
Language support is English-optimized. The stop word list and tokenisation are optimised for English text. Non-English documents will still produce a score, but the stop word filtering will be less effective, which may produce inflated scores due to common function words in other languages not being filtered.
When to use this tool vs a web plagiarism checker
The right tool depends on what question you are trying to answer. There are two fundamentally different questions in the plagiarism space, and they require different tools.
Use this tool when: You want to compare a student draft against source material you already have. You want to compare two versions of a document to see how much has changed between drafts. You want to check if two pieces of content submitted by different writers are suspiciously similar to each other. You want to verify that an AI-assisted draft doesn't closely mirror a specific reference text you fed to the AI. You are a content manager checking whether a freelancer's submission matches a brief or a competitor's page (paste both into the tool). In all these cases, both documents are in your possession and you are doing a private comparison.
Use a web plagiarism checker (Grammarly, Originality.AI) when: You want to know if any content — your own or someone else's — exists elsewhere on the internet. You are checking a student essay for source matches to published academic papers or websites. You are checking AI-generated content for unintentional reproduction of web content. You are an editor verifying a freelancer's submission hasn't lifted paragraphs from existing articles. In these cases, you need a service with a web crawl database — which requires server-side processing and typically a paid subscription.
The two tools are complementary, not competing. A common professional workflow: run this tool first as a quick same-document comparison (free, instant, private), then use Grammarly for web-source verification if the context warrants it.
Paid link — we may earn a commission if you upgrade. See our affiliate disclosure.
Frequently asked questions
Is this a real plagiarism checker?
How does the similarity score work?
What similarity percentage is considered too high?
What is the difference between this tool and Grammarly's plagiarism checker?
Can I use this to check AI-generated content for plagiarism?
Paid link — we may earn a commission if you upgrade. See our affiliate disclosure.