How to Check Text Similarity Between Two Documents Using Python?

Text similarity refers to the process of determining how similar two pieces of text are in terms of meaning. This task is crucial in various Natural Language Processing (NLP) applications, such as information retrieval, sentiment analysis, plagiarism detection and others.

Python, with its rich set of libraries, provides multiple approaches to compute text similarity. This blog will walk you through the basics of NLP and text similarity, explore the applications of text similarity, explain WordNet and NLTK, and implement a text similarity checker in Python using the WordNet-based approach.

What is Natural Language Processing (NLP) and Text Similarity?

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and computational linguistics that focuses on enabling computers to understand, interpret, and generate human language. The complexity of human language makes NLP a challenging yet fascinating field, combining elements from computer science, linguistics, and cognitive psychology.

Text similarity is a key concept within NLP, referring to the task of measuring how similar two pieces of text are. The measurement can be lexical, syntactic, or semantic. Lexical similarity deals with the surface-level features of the text, such as word overlap, while syntactic similarity considers the structure and grammatical patterns. Semantic similarity, on the other hand, is about understanding the meaning and context of the text, aiming to compare texts based on their underlying meanings.

Applications of Text Similarity

Text similarity is vital in various applications, such as information retrieval, document clustering, and duplicate detection. For example, search engines use text similarity algorithms to rank web pages based on the relevance of their content to a given query. Besides this, there’re several applications of text similarity across different domains:

  • Plagiarism Detection: One of the most prominent uses of text similarity is in plagiarism detection. By comparing a student’s assignment with a vast database of existing documents, educators can determine whether the content is original or copied.
  • Semantic Analysis: In semantic analysis, understanding the meaning of words, phrases, and sentences is crucial. Text similarity measures help in tasks like sentiment analysis by identifying similar sentiments expressed in different ways. For example, recognizing that “I am happy” and “I feel great” convey the same sentiment can improve the accuracy of sentiment analysis models.
  • Product Reviews Sentiment Analysis: Businesses often analyze customer feedback to gauge public opinion about their products. Text similarity tools help cluster similar reviews together, making it easier to identify common themes or issues.
  • Finding Patterns in Text: Text similarity is also used in pattern recognition within large corpora. For instance, in text classification tasks, similar documents are grouped, allowing for more efficient and accurate classification.
  • Recommendation Systems: Suggesting similar products or content based on text descriptions and product reviews.

and others.

What are WordNet, Synsets, and NLTK?

WordNet is a large lexical database of English developed at Princeton University under the direction of psychologist George A. Miller. Unlike a dictionary, WordNet is more akin to a thesaurus and organizes English words into sets of synonyms known as synsets. Each synset represents a distinct concept and is interlinked with other synsets through various semantic relationships like antonymy (opposites), hyponymy (subset relationships), and meronymy (part-whole relationships).

A synset (short for “set of synonyms”) is the basic unit in WordNet, representing a single concept. For example, the words “car,” “automobile,” and “auto” all belong to the same synset as they refer to the same concept. Each synset is associated with a part of speech (noun, verb, adjective, or adverb) and is accompanied by a definition and example sentences, providing context for its usage.

NLTK (Natural Language Toolkit) is a powerful library in Python that provides easy-to-use interfaces to over 50 corpora and lexical resources, including WordNet. NLTK offers a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. With NLTK, you can easily perform complex NLP tasks, including working with WordNet to explore semantic relationships between words.

By using WordNet and NLTK together, you can perform semantic analysis, word sense disambiguation, and text similarity checks based on the meanings of words rather than just their surface forms.

Understanding WordNet Synsets and NLTK-based Methods

When it comes to measuring text similarity in Python, traditional approaches often rely on WordNet and NLTK. These methods are particularly useful for beginners to grasp the basics of semantic similarity before moving on to more advanced techniques.

Some useful terms:

  • Tokens are the individual units of text, which can be words, subwords, characters, or symbols, depending on the level of granularity required for a particular NLP task. The process of splitting text into these smaller units is called Tokenization.
  • Part-of-Speech (POS) Tagging is a fundamental task in Natural Language Processing (NLP) that refers to the process of marking up words in a text as corresponding to a particular part of speech, such as nouns, verbs, adjectives, adverbs, etc. Each word (token) is tagged with a label that represents its grammatical role and its context.

To utilize WordNet effectively, we first need to understand the context in which a word is used. This is done by part-of-speech (POS) tagging, which identifies whether a word is used as a noun, verb, adjective, etc.

Example Code

import nltk
from nltk.corpus import wordnet as wn

# Ensure you have the necessary datasets downloaded
nltk.download('wordnet')

# Find and print synsets for a word
word = 'car'

synsets = wn.synsets(word)

for synset in synsets:
    print("Synset:", synset)
    print("POS Tag:", synset.pos())
    print("Definition:", synset.definition())
    print("Examples:", synset.examples())
    print("Lemmas:", [lemma.name() for lemma in synset.lemmas()])
    print()
    
    
    
  
'''
 
   
# Output:

Synset: Synset('car.n.01')
POS Tag: n
Definition: a motor vehicle with four wheels; usually propelled by an 
internal combustion engine
Examples: ['he needs a car to get to work']
Lemmas: ['car', 'auto', 'automobile', 'machine', 'motorcar']

Synset: Synset('car.n.02')
POS Tag: n
Definition: a wheeled vehicle adapted to the rails of railroad
Examples: ['three cars had jumped the rails']
Lemmas: ['car', 'railcar', 'railway_car', 'railroad_car']

Synset: Synset('car.n.03')
POS Tag: n
Definition: the compartment that is suspended from an airship and that 
carries personnel and the cargo and the power plant
Examples: []
Lemmas: ['car', 'gondola']

Synset: Synset('car.n.04')
POS Tag: n
Definition: where passengers ride up and down
Examples: ['the car was on the top floor']
Lemmas: ['car', 'elevator_car']

Synset: Synset('cable_car.n.01')
POS Tag: n
Definition: a conveyance for passengers or freight on a cable railway
Examples: ['they took a cable car to the top of the mountain']
Lemmas: ['cable_car', 'car'] 


'''   
While the wn.synsets function returns all possible synsets for the word, it doesn't explicitly perform POS tagging. WordNet organizes words into sets of synsets. It doesn’t analyze text to determine the part of speech of words within a given context.
However, behind the scenes, it likely leverages built-in heuristics or dictionaries to choose the most likely POS tag based on the word itself. This is why you see different parts of speech like "n" (noun) for different meanings of "car".

Since WordNet doesn't provide POS tagging, external NLP tools like NLTK, SpaCy, or other machine learning models are used to tag the text first. These tools analyze the text, determine the POS of each word, and then you can use that information to query WordNet's synsets and perform more sophisticated analyses.
Part-of-Speech (POS) Tagging using NLTK Library

Here’s a brief example demonstrating how to use NLTK to tokenize a sentence and perform POS (Part-Of-Speech) tagging.

import nltk

# Download the POS tagger model
# The 'averaged_perceptron_tagger' is a pre-trained model provided by NLTK.
nltk.download('averaged_perceptron_tagger')

# Tokenize the sentence ie. split the sentence into individual words (tokens).
text = nltk.word_tokenize("The quick brown fox jumps over the lazy dog")

# Perform POS tagging
# pos_tag function assigns a part-of-speech tag to each token. 
# These tags indicate the grammatical role of each word in the sentence.
pos_tags = nltk.pos_tag(text)

# Print the POS tags
print(pos_tags)



'''

# Output: (list of tokens along with their POS tags)


[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), 
('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), 
('dog', 'NN')]


- `DT`: Determiner
- `JJ`: Adjective
- `NN`: Noun, singular or mass
- `VBZ`: Verb, 3rd person singular present
- `IN`: Preposition or subordinating conjunction


'''

NLTK is a fundamental tool for anyone working in natural language processing and provides extensive resources to start with various text analysis tasks.

'''

Here is the complete list of part-of-speech tags used by the NLTK library 
with the corresponding explanations:

CC - Coordinating conjunction
CD - Cardinal number
DT - Determiner
EX - Existential there
FW - Foreign word
IN - Preposition or subordinating conjunction
JJ - Adjective
JJR - Adjective, comparative
JJS - Adjective, superlative
LS - List item marker
MD - Modal
NN - Noun, singular or mass
NNS - Noun, plural
NNP - Proper noun, singular
NNPS - Proper noun, plural
PDT - Predeterminer
POS - Possessive ending
PRP - Personal pronoun
PRP$ - Possessive pronoun
RB - Adverb
RBR - Adverb, comparative
RBS - Adverb, superlative
RP - Particle
SYM - Symbol
TO - to
UH - Interjection
VB - Verb, base form
VBD - Verb, past tense
VBG - Verb, gerund or present participle
VBN - Verb, past participle
VBP - Verb, non-3rd person singular present
VBZ - Verb, 3rd person singular present
WDT - Wh-determiner
WP - Wh-pronoun
WP$ - Possessive wh-pronoun
WRB - Wh-adverb

These tags are used in the `pos_tag` function in NLTK to classify each word in a 
given text into one of these categories based on its definition and context in the sentence.


'''
Semantic Similarity Measure using WordNet Synsets (Python Code)

This method utilizes the structured lexical database of WordNet to find synsets (sets of cognitive synonyms) for words in a document, and computes similarity scores based on relationships between these synsets. The technique incorporates elements of Natural Language Processing (NLP) by using the NLTK library’s tools for tokenization and part-of-speech tagging to accurately assign WordNet synsets to words.

The similarity measures employed, such as ‘Path’ similarity (path_similarity) or other WordNet-based metrics like ‘Wu-Palmer’ similarity (wup_similarity), focus on the semantic connections between words as defined in WordNet.

To begin with, ensure you have the necessary libraries installed

pip install numpy nltk

You’ll also need to download some NLTK resources.

import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

This is necessary for the script to function correctly, particularly if these resources have not been previously downloaded to your system.

Here’s why each is important:

  • nltk.download('wordnet'): This command downloads the WordNet database, which is essential for accessing synsets. WordNet is used in the script to find synsets for words based on their usage and part of speech, which are critical for measuring semantic similarity between documents.
  • nltk.download('averaged_perceptron_tagger'): This command downloads the data needed for NLTK’s POS (part-of-speech) tagger, which uses the averaged perceptron algorithm. This information is crucial because the synsets retrieved are dependent on the part of speech of each word.

Without these resources, the NLTK functions wn.synsets() and nltk.pos_tag() used in the script wouldn’t be able to operate, as they rely on the data from WordNet and the POS tagging model, respectively. So, if you’re setting up this script on a new machine or in a new environment, these download commands are vital to ensure that all necessary NLTK resources are available.

Python Code Implementation

We’ve developed a Python class, TextSimilarityChecker, that encapsulates all the necessary steps to compute the similarity between two documents. Here’s an overview of its functionality:

  1. Tokenization and POS Tagging: Break down each document into words and tag them with their respective POS.
  2. Synset Conversion: Convert each word and tag into a synset. If multiple synsets are available, select the first one as a representative.
  3. Similarity Calculation: For each synset in the first document, find the most similar synset in the second document based on predefined metrics (Path similarity or Wu-Palmer similarity).
  4. Symmetric Similarity: Since similarity can be directional, we average the similarity from document one to two and from two to one to get a symmetric similarity score.

Here is the Python class that performs these operations:

import numpy as np
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize


class TextSimilarityChecker:

    @staticmethod
    def convert_tag(tag):
        """
        Converts the part-of-speech tag from NLTK's format to WordNet's format.

        NLTK's pos_tag function assigns part-of-speech tags to words in a text.
        It follows the Penn Treebank tagset, which uses single-letter tags to
        represent different parts of speech.

        'N' for nouns
        'J' for adjectives
        'R' for adverbs
        'V' for verbs

        WordNet's synsets, on the other hand, also use single-letter tags to
        represent parts of speech, but they follow a different convention:

        'n' represents nouns
        'a' represents adjectives
        'r' represents adverbs
        'v' represents verbs

        Args:
            tag (str): The part-of-speech tag from NLTK.

        Returns:
            str: The corresponding WordNet tag, or None if no matching tag is found.
        """
        tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
        return tag_dict.get(tag[0].upper())

    @staticmethod
    def doc_to_synsets(doc):
        """
        Converts a document into a list of synsets (sets of synonyms).
        Tokenizes and tags the words in the document, then finds the
        first synset for each word/tag combination. If a synset is not
        found for that combination, it is skipped.

        wn.synsets(token, syntag) function is used to retrieve a
        list of synsets associated with a given word (token) and
        part of speech (syntag). It helps in finding the
        lexical entries based on the word and its context
        in the sentence, which is then used to compute
        semantic similarities between words.

        Args:
            doc (str): The document to convert into synsets.

        Returns:
            list: A list of synsets corresponding to the words in the document.
        """

        tokens = word_tokenize(doc)  # Break down the document into individual words (tokens)
        tags = nltk.pos_tag(tokens)  # Tag each token with a part of speech (POS) using NLTK's tagging functions.

        synsets_list = []

        for token, tag in tags:
            # Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets
            syntag = TextSimilarityChecker.convert_tag(tag)
            if syntag:
                syns = wn.synsets(token, syntag)
                if syns:
                    synsets_list.append(syns[0])

        return synsets_list

    @staticmethod
    def similarity_score(synsets1, synsets2, distance_type='path'):
        """
        Computes the similarity score between two lists of synsets.
        For each synset in the first list (synsets1), finds the most
        similar synset in the second list (synsets2) and calculates
        the similarity score. The score is normalized by the number of synsets.

        Args:
            synsets1 (list): List of synsets from the first document.
            synsets2 (list): List of synsets from the second document.
            distance_type (str): Type of distance metric to use ('path' or 'wup').

        Returns:
            float: The normalized similarity score.
        """
        scores = []
        for synset1 in synsets1:
            max_score = 0
            for synset2 in synsets2:
                score = synset1.path_similarity(synset2) if distance_type == 'path' \
                    else synset1.wup_similarity(synset2)
                if score and score > max_score:
                    max_score = score
            if max_score > 0:
                scores.append(max_score)

        return np.mean(scores) if scores else 0


    def document_similarity(self, doc1, doc2):
        """
        Calculates the symmetrical similarity between two documents (doc1, doc2).
        This is done by computing the similarity score of doc1 onto doc2
        and vice versa, then averaging these two scores.

        Why Two Calls to similarity_score?
        Non-Symmetrical Nature of Similarity Metrics:
        Many similarity metrics are not symmetrical. This means
        that the similarity score from synsets1 to synsets2 might
        differ from synsets2 to synsets1.

        By calculating both similarity_score(synsets1, synsets2)
        and similarity_score(synsets2, synsets1),
        you account for any potential asymmetry.

        Averaging for Symmetry: Averaging the two scores provides a
        symmetrical measure of similarity.
        This means that document_similarity(doc1, doc2)
        will equal document_similarity(doc2, doc1), which is a desirable
        property for many applications.

        Args:
            doc1 (str): The first document.
            doc2 (str): The second document.

        Returns:
            float: The averaged similarity score between the two documents.
        """
        synsets1 = TextSimilarityChecker.doc_to_synsets(doc1)
        synsets2 = TextSimilarityChecker.doc_to_synsets(doc2)

        return (TextSimilarityChecker.similarity_score(synsets1, synsets2) +
                TextSimilarityChecker.similarity_score(synsets2, synsets1)) / 2


def main():
    doc1 = 'I like books'
    doc2 = 'I like reading'

    similarity_checker = TextSimilarityChecker()
    similarity_score = similarity_checker.document_similarity(doc1, doc2)

    print(f"Document Similarity Score: {similarity_score}")


if __name__ == "__main__":
    main()
    

Output

Document Similarity Score: 0.5608974358974359
Code Explanation

A brief overview:

  • doc_to_synsets: This method converts the document into a list of synsets by tokenizing the text, tagging each token with its part of speech, and then finding the appropriate synset for each token.
  • similarity_score: This method calculates the similarity between two sets of synsets. It compares each synset in the first set with all synsets in the second set and finds the highest similarity score for each comparison.
  • document_similarity: This method calculates the symmetrical similarity between two documents by averaging the similarity scores in both directions.
  • convert_tag: This utility function converts the part-of-speech tags from NLTK’s format to WordNet’s format.

How does similarity_score() work?

The similarity_score() function in the SemanticSimilarityChecker class is designed to compute the semantic similarity between two sets of WordNet synsets. Here’s a detailed explanation of how this function works:

def similarity_score(synsets1, synsets2, distance_type='path'):

Parameters

  • synsets1 and synsets2: These are lists of synsets corresponding to the words in two different documents. Each synset represents a set of synonyms that share a common meaning.
  • distance_type: This parameter determines the type of similarity measure to use. The default is 'path', which uses the path-based similarity method. Another common measure is 'wup' (Wu-Palmer similarity), among others.

Different types of Similarity measure in wordnet

WordNet provides several methods to measure the semantic similarity between synsets. These methods mainly leverage the hierarchical structure of WordNet, where words are organized in terms of hypernyms (more general terms) and hyponyms (more specific terms). Here’s an overview of the most commonly used similarity measures provided by WordNet in NLTK:

Path Similarity

  • How It Works: Measures the shortest path in the hypernym/hyponym taxonomy between two synsets. The score is the inverse of the path length between the synsets (from one synset to another). The value ranges from 0 (no similarity) to 1 (identical synsets).
  • Use Cases: Useful when you need a simple and quick measure of similarity that doesn’t require any additional information about the corpus.

Wu-Palmer Similarity (WUP)

  • How It Works: Measures the depth of the two synsets in the WordNet taxonomies, along with the depth of the least common subsumer (LCS), which is the deepest node that is an ancestor of both synsets. The formula is 2 * depth(LCS) / (depth(s1) + depth(s2)).
  • Use Cases: Effective in cases where you need to consider the actual location of the synsets within the taxonomy, providing a balance between depth and path length. It’s good for tasks where more emphasis is placed on the specific branches of the taxonomy being compared.

Lin Similarity

  • How It Works: A measure that combines the path with the information content of the least common subsumer (LCS). Specifically, it uses 2 * log(P(LCS)) / (log(P(s1)) + log(P(s2))), where P indicates the probability of encountering an instance of a synset in a large corpus.
  • Use Cases: Best used when corpus-specific information content is available, making it suitable for applications where differentiation based on rarity or commonality of usage in a specific corpus is important.

Choosing the right measure often depends on the specific requirements of the application, the nature of the text data, and the availability of corpus-specific resources for calculating information content.

Semantic Similarity Measure using WordNet Synsets

Using WordNet synsets for semantic similarity measures is not necessarily outdated, but it is considered less advanced compared to newer methods. Here are some of the advantages and limitations of the above WordNet-based approach:

Advantages

  • Semantic Awareness: Unlike simple lexical similarity measures, the WordNet-based approach considers the meaning of words, allowing for more accurate comparisons between texts with similar semantics but different surface forms.
  • Part-of-Speech Sensitivity: The approach distinguishes between different parts of speech, improving the accuracy of the similarity measure.
  • Robustness to Synonyms: WordNet’s synsets allow the approach to recognize synonyms, making it more robust to variations in wording.

Limitations

  • Coverage Limitations: WordNet does not cover every word in the English language, and its static nature means it may not include newly coined words or slangs.
  • Context Sensitivity: The approach does not account for the broader context of the words within the document, limiting its ability to handle polysemy (words with multiple meanings).
  • Computational Complexity: The process of comparing synsets for every word in two documents can be computationally expensive, particularly for large texts.

While the WordNet-based approach remains a valuable tool for semantic similarity tasks, it is often outperformed by modern methods that incorporate deeper contextual understanding and are more scalable.

What are Modern Alternatives?

Modern alternatives to the WordNet-based approach leverage word embeddings and deep learning models to achieve more accurate and context-aware text similarity measures. Here are some modern alternative approaches:

  • Word2Vec: This algorithm generates word embeddings by analyzing the context in which words appear. Words that appear in similar contexts are mapped to nearby vectors in a high-dimensional space, allowing for more nuanced similarity comparisons.
  • FastText: FastText builds on Word2Vec by incorporating subword information, making it better at handling out-of-vocabulary words and morphological variations. This approach is particularly useful for languages with rich morphology.
  • BERT (Bidirectional Encoder Representations from Transformers): BERT is a deep learning model that generates context-sensitive word embeddings, meaning that the same word can have different representations based on its context. BERT has significantly improved performance on various NLP tasks, including text similarity, by capturing the nuances of language better than static word embeddings.
  • GPT (Generative Pre-trained Transformer): Like BERT, GPT is a transformer-based model that generates context-aware embeddings. GPT is particularly effective at generating text and understanding complex language patterns.

Transformer models, including BERT, GPT, and T5, have set new benchmarks in NLP by using self-attention mechanisms to capture contextual relationships within texts more effectively than traditional approaches like WordNet.

While WordNet-based methods provide a solid introduction to text similarity and are still useful in some applications, modern alternatives like word embeddings and transformer models offer superior performance, especially in tasks that require understanding context and handling large-scale data.

Text similarity is a critical task in NLP, with applications ranging from plagiarism detection to sentiment analysis. Traditional approaches using WordNet and NLTK provide a strong foundation for beginners, offering semantic awareness and a robust introduction to text similarity. However, as NLP continues to evolve, modern alternatives like word embeddings and transformer models offer more accurate and scalable solutions, particularly for complex and resource intensive tasks .

Leave a Reply

Your email address will not be published. Required fields are marked *