Beyond vectors: efficiency in Natural Language Processing
NYU MSDS student Raul Delgado Sanchez talks diffusion maps
When you text a friend saying ‘I’ll fall you later’, how does your iPhone know to correct ‘fall’ to ‘call’? Auto-correct owes its prowess to a field that continues to gain paramount importance among computer scientists, and is an especially lively area of study at our very own Center for Data Science: Natural Language Processing (NLP).
Generally speaking, part of NLP research involves calculating ‘the joint probability distribution of words’ in a language. In other words: researchers working in English, for example, use algorithms to analyze large cachets of English documents and texts, and calculate which words most frequently appear beside each other in various contexts, or words that share semantic similarity (synonyms). After identifying dominant word patterns in the English language, researchers can then write programs that will predict what word may come next in a sentence or a paragraph (‘probability distribution’).
NLP research not only makes features like auto-correct possible, but also has extraordinary implications for academic research. For example, sophisticated NLP programs could eventually help literary scholars or historians fill in the missing words in aged, damaged, or illegible manuscripts.
Today, there are a number of approaches to capturing and understanding language patterns in NLP. A popular method is word2vec, which transforms words into vectors. Words that often appear close to each other or share semantic similarity end up sharing a similar space on a graph like the one below, which depicts word vectors related to ‘good’ and ‘bad’ words.