Embeddings#

Word vectors, or word embeddings, are numerical representations of words in multidimensional space through matrices. The purpose of the word vector is to get a computer system to understand a word. Computers cannot understand text efficiently. They can, however, process numbers quickly and well. For this reason, it is important to convert a word into a number.

image.png

This code will load pre-trained word embeddings, reduce their dimensionality to 3 using PCA, and plot them in a 3D space using matplotlib.

import gensim.downloader as api
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Load pre-trained Word2Vec model
model = api.load("glove-wiki-gigaword-100")  # or use "word2vec-google-news-300" or glove-wiki-gigaword-50

# List of words to visualize
words = ['king', 'queen', 'man', 'woman', 'paris', 'france', 'cat', 'cats']

# Get word vectors
word_vectors = [model[word] for word in words]

# Reduce dimensions to 3 using PCA
pca = PCA(n_components=3)
reduced_vectors = pca.fit_transform(word_vectors)

# Plotting
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

for word, vec in zip(words, reduced_vectors):
    ax.scatter(vec[0], vec[1], vec[2])
    ax.text(vec[0], vec[1], vec[2], word)

ax.set_xlabel('PCA1')
ax.set_ylabel('PCA2')
ax.set_zlabel('PCA3')
plt.show()
../_images/fc76460bbbc1d6d2e985c0c345faf2a55d4d262e4f3747cbd1691d4dafe7fcd5.png

Plurals are in the same direction#

image.png

Words with relationship or similarity (same direction and closer to each other)#

image-2.png

Dot products and Similarity#

Dot products of two vectors help to identify how well they align. Geometrically the dot product is \(+ve\) if the vector points in same direction (plurals, king-queen etc), \(0\) when they are perpendicular and \(-ve\) when they are opposite

image.png

The highest weighted sum dot(c,c) is between vectors that are exactly identical. In contrast, because a and b have no overlapping weight, their dot product is zero. features in b and e seems opposite to each other.

import numpy as np

a = [ 0, 1, 0, 1]
b = [ 1, 0, 1, 0]
c = [ 0, 1, 1, 0]
d = [.5, 0,.5, 0]
e = [ 0, 1,-1, 0]

print(f"a.b  = {np.dot(a,b)}")
print(f"b.c  = {np.dot(b,c)}")
print(f"b.d  = {np.dot(b,d)}")
print(f"c.c  = {np.dot(c,c)}")
print(f"b.e  = {np.dot(b,e)}")
a.b  = 0
b.c  = 1
b.d  = 1.0
c.c  = 2
b.e  = -1

Cosine similarity#

Given two n-dimensional vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product. The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating orthogonality or decorrelation, while in-between values indicate intermediate similarity or dissimilarity.

\[ cos(\theta) = \frac {A \cdot B}{||A||.||B||} \]

Lets use a pre-trained Word to Vector model glove-wiki-gigaword-50 and find out similarities between various words using Cosine Similarity

import gensim.downloader as api

def cosine_similarity(vec_a, vec_b):
    """
    Calculate the cosine similarity between two vectors.
    
    Args:
    - vec_a: A numpy array representing the first vector.
    - vec_b: A numpy array representing the second vector.
    
    Returns:
    - Cosine similarity as a float.
    """
    dot_product = np.dot(vec_a, vec_b)
    norm_a = np.linalg.norm(vec_a)
    norm_b = np.linalg.norm(vec_b)
    return dot_product / (norm_a * norm_b)

# Load pre-trained Word2Vec model
model = api.load("glove-wiki-gigaword-50")

# List of words to visualize
words = ['king', 'queen', 'man', 'woman', 'paris', 'france', 'cat', 'cats', 'angry', 'happy']

# Get word vectors
word_vectors = [model[word] for word in words]

embeddings = dict(zip(words, word_vectors))

king_king = cosine_similarity(embeddings["king"], embeddings["king"])
king_queen = cosine_similarity(embeddings["king"], embeddings["queen"])
king_cat = cosine_similarity(embeddings["king"], embeddings["cat"])
king_man = cosine_similarity(embeddings["king"], embeddings["man"])
king_woman = cosine_similarity(embeddings["king"], embeddings["woman"])
king_paris = cosine_similarity(embeddings["king"], embeddings["paris"])
france_paris = cosine_similarity(embeddings["france"], embeddings["paris"])
cat_cats = cosine_similarity(embeddings["cat"], embeddings["cats"])
angry_happy = cosine_similarity(embeddings["angry"], embeddings["happy"])

print("King vector: ", embeddings["king"])
print(f"Cosine similarity between king_king: ", king_king)
print(f"Cosine similarity between king_queen: ", king_queen)
print(f"Cosine similarity between king_cat: ", king_cat)
print(f"Cosine similarity between king_man: ", king_man)
print(f"Cosine similarity between king_woman: ", king_woman)
print(f"Cosine similarity between king_paris: ", king_paris)
print(f"Cosine similarity between france_paris: ", france_paris)
print(f"Cosine similarity between cat_cats: ", cat_cats)
print(f"Cosine similarity between angry happy: ", angry_happy)
King vector:  [ 0.50451   0.68607  -0.59517  -0.022801  0.60046  -0.13498  -0.08813
  0.47377  -0.61798  -0.31012  -0.076666  1.493    -0.034189 -0.98173
  0.68229   0.81722  -0.51874  -0.31503  -0.55809   0.66421   0.1961
 -0.13495  -0.11476  -0.30344   0.41177  -2.223    -1.0756   -1.0783
 -0.34354   0.33505   1.9927   -0.04234  -0.64319   0.71125   0.49159
  0.16754   0.34344  -0.25663  -0.8523    0.1661    0.40102   1.1685
 -1.0137   -0.21585  -0.15155   0.78321  -0.91241  -1.6106   -0.64426
 -0.51042 ]
Cosine similarity between king_king:  1.0
Cosine similarity between king_queen:  0.7839044
Cosine similarity between king_cat:  0.38628823
Cosine similarity between king_man:  0.5309377
Cosine similarity between king_woman:  0.41133782
Cosine similarity between king_paris:  0.30651847
Cosine similarity between france_paris:  0.8025329
Cosine similarity between cat_cats:  0.78652704
Cosine similarity between angry happy:  0.5709519