NLP – Word Embedding

New Doc 2018-01-19_1New Doc 2018-01-19_2New Doc 2018-01-19_3New Doc 2018-01-19_4New Doc 2018-01-19_5New Doc 2018-01-19_6

Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text. But before we dive into the details of Word Embeddings, the following question should be asked – Why do we need Word Embeddings?

As it turns out, many Machine Learning algorithms and almost all Deep Learning Architectures are incapable of processing strings or plain text in their raw form. They require numbers as inputs to perform any sort of job, be it classification, regression etc. in broad terms. And with the huge amount of data that is present in the text format, it is imperative to extract knowledge out of it and build applications. Some real world applications of text applications are – sentiment analysis of reviews by Amazon etc., document or news classification or clustering by Google etc.

Let us now define Word Embeddings formally. A Word Embedding format generally tries to map a word using a dictionary to a vector. Let us break this sentence down into finer details to have a clear view.

Take a look at this example – sentence=” Word Embeddings are Word converted into numbers ”

A word in this sentence may be “Embeddings” or “numbers ” etc.

A dictionary may be the list of all unique words in the sentence. So, a dictionary may look like – [‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]

A vector representation of a word may be a one-hot encoded vector where 1 stands for the position where the word exists and 0 everywhere else. The vector representation of “numbers” in this format according to the above dictionary is [0,0,0,0,0,1] and of converted is[0,0,0,1,0,0].

This is just a very simple method to represent a word in the vector form. Let us look at different types of Word Embeddings or Word Vectors and their advantages and disadvantages over the rest.

Prediction based Word Embedding:

Mitolov etc. el introduced word2vec to the NLP community. These methods were prediction based in the sense that they provided probabilities to the words and proved to be state of the art for tasks like word analogies and word similarities. They were also able to achieve tasks like King -man +woman = Queen, which was considered a result almost magical. So let us look at the word2vec model used as of today to generate word vectors.

Word2vec is not a single algorithm but a combination of two techniques – CBOW(Continuous bag of words) and Skip-gram model. Both of these are shallow neural networks which map word(s) to the target variable which is also a word(s). Both of these techniques learn weights which act as word vector representations. Let us discuss both these methods separately and gain intuition into their working.New Doc 2018-01-19_7New Doc 2018-01-19_8New Doc 2018-01-19_9

Skip Gram Model :

New Doc 2018-01-19_10

Screenshot-2018-1-18 (16) Lecture 2 Word Vector Representations word2vec - YouTube

Here 2nd layer is projection layer.Projection layer basically computes the projection of the input word vector (w[t] with output word vectors. Output is basically the context words surrounding the input words. It tries to maximize log-likelihood of observing context words given input words.Projection layer basically extracts the context words of the input word and tries to maximize likelihood of projection of input word vectors with context words.New Doc 2018-01-19 (1)_1

 SkipGram with Negative Sampling/Word Embedding :


Frequency based word embedding :

Window based co-occurrence matrix

Example corpus:
• I like deep learning.
• I like NLP.
• I enjoy flying.
counts I like enjoy deep learning NLP flying .
I                       0 2 1 0 0 0 0 0
like                  2 0 0 1 0 1 0 0
enjoy               1 0 0 0 0 0 1 0
deep                0 1 0 0 1 0 0 0
learning         0 0 0 1 0 0 0 1
NLP                 0 1 0 0 0 0 0 1
flying              0 0 1 0 0 0 0 1
.                        0 0 0 0 1 1 1 0

Problems with simple co-occurrence vectors

  1. Increase in size with vocabulary
  2. Very high dimensional: require a lot of storage
  3. Subsequent classification models have sparsity issues
  4. Models are less robust

Solution: Low dimensional vectors
• Idea: store “most” of the important information in a fixed, small
number of dimensions: a dense vector
• Usually 25 – 1000 dimensions, similar to word2vec
• How to reduce the dimensionality?

Singular Value Decomposition of co-occurrence matrix X.

GloVe: Global Vectors for Word Representation :

Combination of matrix factorization methods with skip gram model.


References :


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s