Semantic Similarity, LSA and Wordnet

May 20, 2020

Semantic similarity:

The similarity is a metric that determines how two objects are alike. We need to find the features of the objects. It is used to group similar words with similar meanings. It is a building block in the natural language understanding tasks.

The similarity measure is a value that ranges from 0 to 1. If two objects have a lower similarity score, that means they are not similar. A larger value means the words are more similar.

Measure:

Euclidean distance: Pythagoras theorem to calculate distances between two objects. Length of the path connecting two objects.
Manhattan distance: Sum of the absolute difference between their cartesian coordinates.
(Sparse vectors/ data) Cosine similarity: Normalisation of the dot product of two attributes, basically the angle of separation between two attributes.

Wordnet

Semantic dictionary of words, interlinked by Semantic relations. It includes rich linguistic information such as part of speech, synonym, antonym. It’s machine-readable and extensively used. It arranges information in a hierarchy where there’s a parent root and there are many similarities that measure the hierarchy in a similar way.

Measures to calculate nearest words:

import nltk
from nltk.corpus import wordnet as wn
deer = wn.synset('deer.n.01')
elk = wn.synset('elk.n.01')

1. Path similarity is the shortest distance between two concepts. The similarity measure is inversely related to path distance.

Path_similarity(deer, elk) = ⅕ = 0.5

path_similarity(deer, gir) = ⅓ = 0.33

deer.path_similarity(elk)
deer.path_similarity(horse)

2. Lowest common subsumer(LCS) - Find the closest ancestor to both concepts.

LCS(deer, elk) = deer

LCS(deer, giraffe) = ruminant

3. Linear similarity - This similarity is based on the information contained in LCS of two concepts.

LinSim = 2 * log(P(LCS(u, v)) / (log P(u) + log P(v))

from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
deer.lin_similarity(elk, brown_ic)
deer.lin_similarity(horse, brown_ic)

4. Collocations and distributional similarity - Two words that frequently appear in similar contexts are more likely to be semantically related.

Distributional Similarity and strength of association between words is how frequently the occurs together (collocate).

Latent Semantic Analysis (LSA):

Function:

It’s a method of unsupervised learning. It's a fully automatic mathematical/ statistical technique of extracting and inferring relations of expected contextual usage of words in passages of discourse. Using matrix operations we can obtain relations among words. This algorithm can be used if we want to determine words from a context.

Thanks for reading.

Reference: https://www.coursera.org/lecture/python-text-mining/semantic-text-similarity-DpNWl

create-semantic-similarity

create-wordnet

machine-learning

natural-language-processing

neural-networks

Author

Ishika Saha

@schizzika

Biryani eater.