Free Plagiarism Similarity Checker — Compare Two Documents

How the similarity checker works

This tool measures how similar two texts are by analysing their vocabulary overlap — not by comparing them to any external source. It runs entirely in your browser using client-side JavaScript, which means your text is never sent to any server. You paste two documents, click "Check Similarity," and get a percentage score within milliseconds.

The underlying algorithm has three stages. First, both texts are tokenised: split into individual words, converted to lowercase, stripped of punctuation, and filtered to remove common stop words (words like "the," "is," "and" that appear in almost all text and carry no discriminating information). The remaining meaningful tokens form the vocabulary for comparison.

Second, each text is converted into a TF-IDF weighted vector. Term Frequency (TF) measures how often a word appears in a document relative to the total word count. This weighting ensures that words appearing many times in a text are treated as more representative of that text's content. The result is a numerical fingerprint of each document's vocabulary.

Third, the two vectors are compared using cosine similarity — a standard mathematical technique that measures the angle between two vectors in high-dimensional space. A cosine similarity of 1.0 (100%) means the two documents have an identical vocabulary distribution. A score of 0% means they share no meaningful words at all. Real documents covering the same topic typically land in the 15–50% range even when independently written, because they naturally use similar subject-specific vocabulary.

The tool also runs an additional analysis to surface the most common shared multi-word phrases (bigrams and trigrams) between the two texts. If "machine learning model" appears in both documents, that phrase will appear in the "Top matching phrases" section. These phrase matches are more diagnostic than the overall score — a high overall score with no phrase matches suggests topical overlap, not copying. A lower overall score with multiple specific phrase matches is more suspicious.

What is TF-IDF? A plain-language explanation

TF-IDF stands for Term Frequency–Inverse Document Frequency. It is one of the oldest and most reliable techniques in information retrieval, originally developed in the 1970s for library science and document classification. Despite being decades old, it remains the backbone of many search engines, recommendation systems, and content-matching tools because it is fast, explainable, and requires no training data or machine learning models.

Term Frequency (TF) is simple: how often does a word appear in a document, relative to the document's total length? A 500-word article that uses the word "mortgage" 10 times has a TF of 2% for "mortgage." This is higher than the same word appearing once in a 2,000-word article (0.05% TF). Higher TF means the word is more central to that document's content.

Inverse Document Frequency (IDF) adds a correction for words that appear everywhere. The word "important" might appear frequently in many documents — its high TF would be misleading if we used TF alone, because "important" doesn't distinguish one document from another. IDF down-weights words that appear in many documents and up-weights words that appear in only a few. In a collection of medical papers, "patient" would have low IDF (it's everywhere), while "glomerulonephritis" would have high IDF (only appears in very specific contexts).

In this tool, we use a simplified version: because we are comparing exactly two documents (not a large corpus), we focus on TF rather than corpus-level IDF. The stop-word filter performs a similar function to IDF — it removes common words that would add noise to the comparison. This is why the similarity score measures meaningful vocabulary overlap, not raw word-by-word matching.

Similarity range	Typical meaning	Action
0–19%	Low — documents cover different topics or use very different vocabulary	No concern
20–39%	Moderate — documents share a topic or domain; natural overlap	Review phrase matches
40–59%	Notable — significant vocabulary overlap; may be same source material or paraphrasing	Investigate further
60–79%	High — strong structural and vocabulary similarity; likely paraphrase or close rewrite	Manual review needed
80–100%	Very high — near-identical content; one text is likely derived from the other	Treat as a match

Important limitations — read before drawing conclusions

This tool is honest about what it can and cannot do. Understanding its limitations will help you use it correctly and avoid false positives (flagging innocent similarity) or false negatives (missing real issues).

It does not check against the internet. This is the most important limitation. This tool has no network access. It cannot compare your text against published web pages, academic databases, news articles, or any external source. The 100% of the score reflects similarity between the two texts you provided — nothing else. Two completely original documents on the same topic can score 30–50% simply because they use similar subject-specific vocabulary, even if neither has seen the other.

It does not detect paraphrasing well. TF-IDF is good at detecting vocabulary overlap but weak at detecting sophisticated paraphrasing. If a writer rewrites every sentence using synonyms while preserving all the ideas and structure, the vocabulary vectors will diverge and the score will drop — even though the intellectual content is substantially borrowed. Human review (or a semantic similarity model like Originality.AI) is needed to catch heavy paraphrasing.

It is sensitive to document length. Very short texts (under 100 words) produce unreliable scores because there are not enough tokens for statistical significance. A 50-word text that happens to share 10 words with another 50-word text will score very high, even if the overlap is coincidental. Minimum recommended length for a meaningful score is approximately 150–200 words per document.

Technical and domain-specific texts score higher naturally. Two independently written articles about "Python decorators" or "compound interest calculations" will share significant vocabulary — @property, __init__, self, rate, principal — regardless of copying. Apply a higher threshold before flagging technical documents. The phrase-matching feature is more diagnostic here: shared generic vocabulary is expected; shared specific examples or code snippets are suspicious.

Language support is English-optimized. The stop word list and tokenisation are optimised for English text. Non-English documents will still produce a score, but the stop word filtering will be less effective, which may produce inflated scores due to common function words in other languages not being filtered.

When to use this tool vs a web plagiarism checker

The right tool depends on what question you are trying to answer. There are two fundamentally different questions in the plagiarism space, and they require different tools.

Use this tool when: You want to compare a student draft against source material you already have. You want to compare two versions of a document to see how much has changed between drafts. You want to check if two pieces of content submitted by different writers are suspiciously similar to each other. You want to verify that an AI-assisted draft doesn't closely mirror a specific reference text you fed to the AI. You are a content manager checking whether a freelancer's submission matches a brief or a competitor's page (paste both into the tool). In all these cases, both documents are in your possession and you are doing a private comparison.

Use a web plagiarism checker (Grammarly, Originality.AI) when: You want to know if any content — your own or someone else's — exists elsewhere on the internet. You are checking a student essay for source matches to published academic papers or websites. You are checking AI-generated content for unintentional reproduction of web content. You are an editor verifying a freelancer's submission hasn't lifted paragraphs from existing articles. In these cases, you need a service with a web crawl database — which requires server-side processing and typically a paid subscription.

The two tools are complementary, not competing. A common professional workflow: run this tool first as a quick same-document comparison (free, instant, private), then use Grammarly for web-source verification if the context warrants it.

For real web plagiarism detection: Grammarly Premium Grammarly scans against 16 billion web pages and academic papers. It tells you exactly which source a passage matches, with a direct link to the original. Start with a free account — plagiarism checking requires Premium.

Try Grammarly Free →

Paid link — we may earn a commission if you upgrade. See our affiliate disclosure.

Frequently asked questions

Is this a real plagiarism checker?

No — and we want to be upfront about that. This tool checks the text similarity between two documents you paste side-by-side. It does not scan the internet, academic databases, or any external source. A high similarity score means the two texts you pasted share vocabulary and phrases — it does not mean either document is plagiarised from a web source. For actual web plagiarism detection (comparing your text against billions of published pages), you need a dedicated tool like Grammarly Premium or Originality.AI.

How does the similarity score work?

The tool uses TF-IDF (Term Frequency–Inverse Document Frequency) analysis combined with cosine similarity. Both documents are tokenised (split into words, stop words removed), each word is weighted by how often it appears relative to the total text, and then the two weighted vectors are compared using the cosine similarity formula. The result is a percentage from 0% (no meaningful overlap) to 100% (near-identical content). It also extracts bigrams (two-word phrases) and trigrams (three-word phrases) to surface the most common shared multi-word sequences.

What similarity percentage is considered too high?

There is no universal threshold because the right answer depends entirely on context. Academic institutions typically flag anything above 15–25% for manual review, though some allow higher for technical fields with shared terminology. In SEO content writing, two pages covering the same topic will naturally share 20–40% vocabulary without any copying involved — synonyms, headings, and topic-specific terms overlap legitimately. A similarity score above 60% between two documents that are supposed to be different is a strong signal worth investigating, but it is not automatically proof of plagiarism. Use this tool for a quick first screen, then apply judgment.

What is the difference between this tool and Grammarly's plagiarism checker?

The difference is fundamental. This tool only compares the two texts you paste — it has no internet access, no database, and no awareness of anything outside your two inputs. Grammarly's plagiarism checker (Premium feature) scans your text against 16 billion web pages and academic papers in real time to find source matches. Grammarly will tell you "this passage matches content at nytimes.com." This tool will only tell you "these two texts you gave me are 34% similar." Both are useful for different purposes. Use this tool when you want to compare a draft against a reference you already have. Use Grammarly when you need to verify originality against the wider web.

Can I use this to check AI-generated content for plagiarism?

This tool can check AI-generated text for similarity against another document you provide — for example, comparing an AI draft against source material you fed to the AI prompt. What it cannot do is detect whether content is AI-generated (that requires a separate AI detection tool like Originality.AI), nor can it compare AI output against web sources (use Grammarly for that). For a complete workflow when using AI writing tools: run AI detection first (Originality.AI), then web plagiarism check (Grammarly), and use this tool only for document-to-document comparison where both inputs are in hand.

Write original content — and prove it. Grammarly Premium detects plagiarism against 16B+ web sources and provides clarity, grammar, and tone improvements in one tool. Free to start.

Try Grammarly Free →

Paid link — we may earn a commission if you upgrade. See our affiliate disclosure.