How Axela, Google Assistant and Google Home can understand what we(humans) are saying or asking anything to them? How the first Robot citizen understands and speaks English with humans? It all starts with Natural language processing. From the last couple of years we humans have generated a huge amount of text data, and it is very important for machines to understand and process text data if you are want to give them the power of Artificial intelligence. In this post, we will learn how to Prepare Text Data for Machine Learning with scikit-learn.
Machines can not understand English or any text data by default. The text data needs a special preparation before you can give text data to the machine to predict something out of it. That special preparation includes several steps such as removing stops words, correcting spelling mistakes, removing meaningless words, removing rare words and many more.
In natural language processing’s terms, we will perform three major steps which are feature extraction, text pre-processing and advance text preprocessing. These steps include several sub-steps as well, which we will see down the road.
1. Feature extraction and text pre-processing
The first step of preparing text data is applying feature extraction and basic text pre-processing. In feature extraction and basic text pre-processing there several steps as follows,
- Removing Punctuations
- Removing HTML tags
- Special Characters removal
- Removing AlphaNumeric words
- Tokenization
- Removal of Stopwords
- Removing most Frequent words and Rare words
- Correcting Spelling mistakes
- Lower casing
- Stemming
- Lemmatization
Before we apply feature extraction and text pre-processing first we will load out dataset using pandas library. Here we will use amazon’s food review dataset available at kaggle. Also, we will drop the duplicate rows present in the dataset.
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ import pandas as pd # Importing the dataset dataset = pd.read_csv('Reviews.csv') dataset = dataset.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
1.1 Removing Punctuations
The first step is removing punctuations from the text. For humans, it adds value but for the machine, it doesn’t really useful. Use the below code snippet to remove punctuations,
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ import re # specific phrase = re.sub(r"won't", "will not", review) phrase = re.sub(r"can\'t", "can not", review) # general phrase = re.sub(r"n\'t", " not", review) phrase = re.sub(r"\'re", " are", review) phrase = re.sub(r"\'s", " is", review) phrase = re.sub(r"\'d", " would", review) phrase = re.sub(r"\'ll", " will", review) phrase = re.sub(r"\'t", " not", review) phrase = re.sub(r"\'ve", " have", review) phrase = re.sub(r"\'m", " am", review)
1.2 Removing HTML tags
Okay, now this one is obvious right! When you get the text data from web scrapping and it is very common that you end having HTML tags in your dataset. HTML is for decorating the texts in the Web pages, which is not helpful in the Model building. In general removing HTML tags good practice to follow, use the below code,
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ from bs4 import BeautifulSoup review = 'I <strong>really</strong> like this food' soup = BeautifulSoup(review, 'lxml') print(soup.get_text()) Output: > I really like this food
1.3 Special Characters removal
You might find some word or characters in the dataset which has special characters, which are not helpful in NLP. The best example I can give you is the usage of Hashtags in comments. Removing special characters means keeping only alphabets in text data. Use the below code snippet in order to remove special characters,
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ import re review = 'I really like this food #foodlove #eatfit' review = re.sub('[^a-zA-Z]', ' ', review) print(review) Output: > I really like this food foodlove eatfit
1.4 Removing AlphaNumeric words
Again, AlphaNumeric words don’t help in building a predictive model. These words don’t have meaning, so it’s better to get rid of them as well. Use the below code,
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ import re review = 'I really like this food11333' review = re.sub('\S*\d\S*', '', review).strip() print(review) Output: > I really like this
1.5 Tokenization
Tokenization means that parsing your text into a list of words. Basically, it helps in other pre-processing steps, such as Removing stop words which is our next point. Use the below code for tokenization,
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ review = 'This Food is Very Good' review = review.split() Output: > ['this', 'food', 'is', 'very', 'good']
1.6 Removal of Stopwords
Stopwords should be removed from the text data, these words are commonly occurring words in text data, for example, is, am, are and so on. Use the below code to remove the stopwords,
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ import nltk nltk.download('stopwords') from nltk.corpus import stopwords review = 'This food is very good' review = ' '.join(review for review in review.split() if review not in set(stopwords.words('english'))) print(review) Output: > This food good
1.7 Removing most Frequent words and Rare words
In the above section, we removed the stops words, which very very common in any text data. Also, there some words which occur in text data very Frequently and most of the time they so many that their presence in becomes useless.
In the case of rare words its opposite, their presence is neglected because of Frequent words. When I say rare, that means they are so rare that they are dominated by the noise made by Frequent words and other general words. Below is code to find the most Frequent words and Rare words,
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ import nltk review = 'I love this, this is so cool; you will love it' allWords = nltk.tokenize.word_tokenize(review) allWordDist = nltk.FreqDist(w.lower() for w in allWords) print(allWordDist.items())
1.8 Correcting spelling mistakes
There is a situation where you will find a lot of spelling mistakes in text data. It is one of the useful steps in text pre-processing. If you have large dataset it takes time to correct all the wrong words in the dataset Use the below code to correct the spelling mistakes,
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ from textblob import TextBlob text = "This lobe this foad" print(TextBlob(text).correct()) Output: > I love this road
There is one step that you need to perform before this, sometimes you will find shorthand version of words for example, text is used as txt. So these words must be corrected before applying this step, spelling correction algorithm might convert them into a different word.
1.9 Converting words into lower case
One of the most important steps is converting words into lower case. This will reduce duplicate copies of the same word if they are in different cases.
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ review = 'This Food is Very Good' review = review.lower() Output: > this food is very good
1.10 Stemming
Stemming helps in reducing the size of Sparsh matrix. It removes the suffices ‘ed‘, ‘ing‘, ‘y‘ and so on from the words. For example, if your text data has loved and loving gives us a positive hint. So keeping two variations of the same word is not useful.
Basically, the idea is to convert the words into its root word, but this there one more technique which more effective than stemming. It’s called Lemmatization, which we will see in the next section. Use the below for Stemming,
# -*- coding: utf-8 -*- """ Prepare Text Data for Machine Learning """ # For Stemming import nltk nltk.download('stopwords') ps = PorterStemmer() review = 'I loved this food that much, that I took sometime from my hectic schedule and wrote this review' review = ' '.join([ps.stem(word) for word in review.lower().split()]) print(review) Output: > i love thi food that much, that i took sometim from my hectic schedul and wrote thi review
1.11 Lemmatization
Lemmatization is more effective than stemming. It removes the inflectional endings of the word by using the vocabulary and morphological analysis of words.
You can use Stemming or Lemmatization depending upon your usage, but I would suggest you to use Lemmatization over stemming.
# -*- coding: utf-8 -*- """ Prepare Text Data for Machine Learning with scikit-learn """ # For Lemmatization nltk.download('wordnet') from nltk.stem.wordnet import WordNetLemmatizer lmtzr = WordNetLemmatizer() review = 'I loved this food that much, that I took sometime from my hectic schedule and wrote this review' review = ' '.join([lmtzr.lemmatize(word, 'v') for word in review.lower().split()]) print(review) Output: > i love this food that much, that i take sometime from my hectic schedule and write this review
2. Prepare text data for machine learning using Advance text preprocessing
By far, we have learned and understood basic text pre-processing, now let’s understand Advance text preprocessing and what we can achieve with it. We will learn how to encode text data into integers and floating point values so that they be applied in machine learning algorithms. In this section, we will learn below-listed topics,
- Bag-of-Word Model
- Uni-grams, Bi-grams, and N-grams
- TF-IDF
2.1 Bag-of-Words Model
The Bag-of-Words model or for short BoW is an advance text pre-processing technique, which very easy to understand and widely used in document classification. BoW creates a set of words or in other words, it creates a dictionary of words from the single document. Then it converts that dictionary of words into a vector, where each word is a separate dimension. This is achieved by assigning each word a unique number.
For example, I have two comments on the food product as shown below,
C1 = This Food is delicious and I love it
C1 = This Food is tasty and I liked it
The corresponding Vectors V1 and V2 of the above two comments we will be as follows,
For commentThis Food is delicious and I loved it (C1), the Vector is shown below
V1 | This | Food | is | delicious | and | I | loved | it | tasty | liked |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
For commentThis Food is tasty and I liked it (C2), the Vector is shown below
V2 | This | Food | is | delicious | and | I | loved | it | tasty | liked |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
In the above case, the sparsity of the matrix will be low because we have very fewer Zeros. If you have less sparsity in your matrix, then your model will perform well.
sparsity means Number of zeros available in Matrix.
But this would be the general case, so try to keep sparse of the matrix as low as you can. You the below code for BoW,
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ # Bag of Word Model from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'this food be delicious and i love it', 'this food be tasty and i like it' ... ... ... ... ] #set max_features accroding to your size of dataset cv = CountVectorizer(max_features = 1500) X = cv.fit_transform(corpus).toarray()
2.2 Uni-grams, Bi-grams and N-grams
The Uni-grams is the default method used in BoW model while creating Vectors from the text data. Although you can specify which method should be used in the BoW model. The idea here to make sense from Text data, but in some case, uni-grams are not enough.
The Uni-grams for reviewThis Food is delicious and I loved it (C1), is shown below,
V1 | This | Food | is | delicious | and | I | loved | it |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Remember, In Uni-grams each word is considered as a dimension. Now when you create Bi-grams of this same review it will take each two adjacent words to create a Bi-gram, unlike Uni-grams which takes each word to create gram. The Bi-grams for the above review will look like this,
V1 | This Food | Food is | is delicious | … | … | … | … | loved it |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
In N-grams, N
can be any number and as the number N
increases the dimension of the vector is also increases. It adviced that you should not increase number N too much or else you will fail to capture the general case.
In the below snippet we are specifying the Tri-grams in BoW model.
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ # Bag of Word Model With Tri-grams from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'this food be delicious and i love it', 'this food be tasty and i like it' ... ... ... ... ] #set max_features accroding to your size of dataset cv = CountVectorizer(ngram_range=(1,3), max_features = 1500) X = cv.fit_transform(corpus).get_shape()
Explanation:
- The
CountVectorizer()
class, we are passing a value to parameterngram_range
as(1,3)
. The ngram_range parameter expects a tuple. - Now BoW will implement Uni-grams, Bi-Grams, and Tri-grams since we have specified tuple from 1 to 3.
- Mind you again this will take a lot of time to execute if you have a big dataset, also it will increase the dimensionality of the matrix.
2.3 TF-IDF ( Term Frequency-Inverse Document Frequency )
Term Frequency
The Term Frequency is the ratio of the particular word occurs in any review, to the length of the review. Hence, the value of Term Frequency for a particular word associated with a single review will be always between 0 to 1. So mathematically you represent Term Frequency as,
TF = (Number of times word W occurs in the particular row) / (number of words in that row)
Inverse Document Frequency
Inverse Document Frequency is calculated based on the occurrence of any word incomplete data corpus. Its part of information retrieval technique, basically it helps you find the number of occurance of words in a set of documents.
IDF = log(Number of rows in document corpus) / (number of rows in which word was present)
But why to use TF-IDF? Good question, As I said earlier, the TF-IDF is one of information retrival technqiue and used to calculate word frequencies in the perticular document corpus. Withe the help scikit-learn you can find out the word frequencies using TfidfVectorizer class. Use the below code for the same,
# -*- coding: utf-8 -*- """ How to Prepare Text Data for Machine Learning with scikit-learn """ from sklearn.feature_extraction.text import TfidfVectorizer corpus = [ 'this food be delicious and i love it', 'this food be tasty and i like it' ... ... ... ... ] # Creating the transform cv = TfidfVectorizer() # Get the shape of Vector x = cv.fit_transform(corpus).get_shape()
Conclusion
In this article, we understood the building blocks of how Prepare Text Data for Machine Learning and Deep learning models. With the help of all these techniques, you can improve your models much better. You can also apply parameter tuning and try to observe the difference in the results
In the next coming article, we will do the sentiment analysis with the help of these techniques. In case if you have any questions regarding this topic, ask your questions in below comment section. If you want to read moremachine learning articlessubscribe to our Newsletter and I’ll see you in the next post.