The similarity is a metric that determines how two objects are alike. We need to find the features of the objects. It is used to group similar words with similar meanings. It is a building block in the natural language understanding tasks.
The similarity measure is a value that ranges from 0 to 1. If two objects have a lower similarity score, that means they are not similar. A larger value means the words are more similar.
Semantic dictionary of words, interlinked by Semantic relations. It includes rich linguistic information such as part of speech, synonym, antonym. It’s machine-readable and extensively used. It arranges information in a hierarchy where there’s a parent root and there are many similarities that measure the hierarchy in a similar way.
Measures to calculate nearest words:
import nltkfrom nltk.corpus import wordnet as wn
deer = wn.synset('deer.n.01')
elk = wn.synset('elk.n.01')
1. Path similarity is the shortest distance between two concepts. The similarity measure is inversely related to path distance.
Path_similarity(deer, elk) = ⅕ = 0.5
path_similarity(deer, gir) = ⅓ = 0.33
deer.path_similarity(elk)
deer.path_similarity(horse)
2. Lowest common subsumer(LCS) - Find the closest ancestor to both concepts.
LCS(deer, elk) = deer
LCS(deer, giraffe) = ruminant
3. Linear similarity - This similarity is based on the information contained in LCS of two concepts.
LinSim = 2 * log(P(LCS(u, v)) / (log P(u) + log P(v))
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
deer.lin_similarity(elk, brown_ic)
deer.lin_similarity(horse, brown_ic)
4. Collocations and distributional similarity - Two words that frequently appear in similar contexts are more likely to be semantically related.
Distributional Similarity and strength of association between words is how frequently the occurs together (collocate).
It’s a method of unsupervised learning. It's a fully automatic mathematical/ statistical technique of extracting and inferring relations of expected contextual usage of words in passages of discourse. Using matrix operations we can obtain relations among words. This algorithm can be used if we want to determine words from a context.
Thanks for reading.
Reference: https://www.coursera.org/lecture/python-text-mining/semantic-text-similarity-DpNWl