…continuing to work my way through Andrew Ng’s deep learning classes on Coursera. Week 2 of the Sequence Models class focuses on word embeddings.

Word embeddings

A word embedding is a projection of natural language words into a high dimensional space. Words are represented by vectors of sizes ranging from 50 to 300 or more. In some mysterious way, that vector space encodes relationships between words based on their co-occurence in a large corpus - relationships like countries and their capitals, verbs and their tenses, words and their comparitive and superlative forms (big, bigger, biggest), or gender-specificity of words like brother and sister.

Word2Vec is one algorithm for learning word embeddings. Another is GloVe (Global Vectors for Word Representation) with pretrained embeddings available for download from the Stanford NLP group.

Sentiment analysis

Word embeddings can help with tasks like sentiment analysis, particularly in cases where you can extend the reach of a small training set by leveraging embeddings from a large corpus. RNNs help characterize phrases like “Not feeling happy” or “Completely lacking in good atmosphere and good taste” where memory is needed to pick up the negation of positive words.

Debiasing

Interestingly, word embeddings make visible some of the unwanted cultural baggage we’re all carrying around with us. See the paper Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.

Textio is a Seattle company that’s made a product out of helping people see the structure, desired or undesirable, underneath the language used in job descriptions and elsewhere. Hey rockstar ninja code-warriors, ready to work hard and play hard? 🙄