Feature engineering in NLP involves the process of transforming and creating meaningful features from raw text data to improve the performance of machine learning models. This is essential because many machine learning algorithms require numerical input data, while NLP tasks often involve working with unstructured text data. Typically this process includes the following key steps:
- Text Preprocessing: This step involves cleaning and preparing the raw text data.
- Feature Extraction: In NLP, feature extraction involves converting text data into numerical features that can be used by machine learning algorithms. Common techniques for feature extraction include:
- TRADITIONAL FEATURES ENGINEERING MODELS
- Bag of Words (BoW): Representing text as a matrix of word frequencies or presence/absence indicators.
- N-grams: Capturing sequences of N words to consider local context.
- Term Frequency-Inverse Document Frequency (TF-IDF): Assigning weights to words based on their importance in a document relative to a corpus of documents.
- We even have a specific example when we use TF-IDF vectorization.
- Documents Similarity: document distance in NLP refers to quantifying how similar or dissimilar two or more text documents are. A very common metrics include cosine similarity, so I've done a script which calculates and compares cosine similarity between three pairs of matrices. It does this using both a library version and a custom "from scratch" version of the cosine similarity calculation function.
- Document Clustering: document clustering is a text analysis technique that groups similar documents together based on their content. This process makes it easier to identify patterns and themes within a large set of documents. To illustrate how this works, I've created a script that performs hierarchical clustering on a collection of documents. It begins by measuring the similarity between the documents using their TF-IDF representations. Then, it constructs a hierarchical tree structure (dendrogram) to visualize how the documents are grouped based on their similarity. For the sake of simplicity, I've used only four sentences as documents.
- Topic Models: they are statistical models used to discover hidden thematic structures within a collection of documents. They enable the identification and analysis of common topics or themes. As an example, I've chosen the Latent Dirichlet Allocation (LDA) algorithm and implemented a script that performs topic modeling on the text data. This script returns a Pandas DataFrame containing the topic distribution for each document in the corpus.
- ADVANCED FEATURES ENGINEERING MODELS
- Word2vec: This technique transforms words into numerical vectors, enabling computers to comprehend and process them effectively; it is an integral part of the vector space model, representing words as vectors in a manner where similar words are positioned close in this space, in order to make the original text undestable by the computer. To show how this works I've create a script that uses the Gensim library to train a Word2Vec model on a given corpus of text; then finding the vector representation of the word 'sun' (for example) and identifying the top 3 words most similar to 'sun' based on the learned word embeddings.
- CBOW (Continuous Bag of Words): it is a word embedding model that learns to predict a word from its context words in a sentence, creating word vectors that represent words' meanings and relationships. To provide a simple example, I've developed a script that tokenizes and preprocesses Shakespeare's "Hamlet," generates training data for a Skip-Gram embedding model, trains the model, subsequently extracts word embeddings to predict the context words given a target word.
- Skip-Gram: it is a technique that helps us grasp the meanings of words and their relationships by analyzing the context in which they appear within a substantial body of text. In simpler terms, it enables us to capture the semantic connections between words. To illustrate, I've developed a script that performs tasks akin to the CBOW example. However, in this case, it extracts word embeddings to predict the target word based on its context words.
- Gensim: it is a efficent implementation of the Word2Vec model. It's widely used for tasks like text document similarity, topic extraction, and word vector representations. To illustrate, I've developed a script that training a Gensim Word2Vec model on a text corpus, extracting word embeddings, and finding similar words to a specified target word, with adjustable model parameters.
- Glove: it stands for 'Global Vectors' and is an unsupervised learning model that can be used to obtain dense word vectors, similar to Word2Vec. To illustrate how it works, I've created a script that processes text, generating word embeddings using both CBOW and GloVe. CBOW captures word meanings based on local context, while GloVe analyzes word relationships in a broader context. Finally, the script performs word similarity and analogy tasks, such as measuring the similarity between 'king,' 'queen,' and 'woman,' if these words are present in the model's vocabulary. In the 'FeatureEngineering' directory, you can find simple examples of how to use these techniques.
- TRADITIONAL FEATURES ENGINEERING MODELS