Sentiment Analysis is one of the most used branches of Natural language processing. With the help of Sentiment Analysis, we humans can determine whether the text is showing positive or negative sentiment and this is done using both NLP and machine learning. Sentiment Analysis is also called as Opinion mining. In this article, we will learn about NLP sentiment analysis in python.
From reducing churn to increase sales of the product, creating brand awareness and analyzing the reviews of customers and improving the products, these are some of the vital application of Sentiment analysis. Here, we will implement a machine learning model which will predict the sentiment of customer reviews and we will cover below-listed topics,
- The problem statement
- Feature extraction and Text pre-processing
- Advance text preprocessing
- Sentiment Analysis by building a machine learning model
Prerequisite
The problem statement
One of the popular application of sentiment analysis is predicting sentiment of customer reviews. This is helpful for banking, eCommerce in fact in all domains where you are selling some product to customers. As I said earlier we will implement the machine learning model which will predict the Positive or Negative sentiment based on the reviews of the product.
Basically, we will create a machine learning model which will predict if the new incoming customer review is positive or negative. For this article, we will use amazon’s food review dataset available atkaggle.
Feature extraction and Text pre-processing
Machines can not understand English or any text data by default. The text data needs a special preparation before you can give text data to the machine to predict something out of it. That special preparation includes several steps such as removing stops words, correcting spelling mistakes, removing meaningless words, removing rare words and many more.
The first step of preparing text data is applying feature extraction and basic text pre-processing. In feature extraction and basic text pre-processing there several steps as follows,
- Removing Punctuations
- Removing HTML tags
- Special Characters removal
- Removing AlphaNumeric words
- Tokenization
- Removal of Stopwords
- Lower casing
- Lemmatization
=> Before we apply feature extraction and text pre-processing first we will load out dataset using pandas library. => Create file nlp.py
and write down below code into it.
nlp.py:
# -*- coding: utf-8 -*- """ NLP sentiment analysis in python """ import pandas as pd # Importing the dataset dataset = pd.read_csv('Reviews.csv')
Let’s import libraries for text pre-processing and later we will use these libraries to do the basic text pre-processing.
=> We have imported bs4
for Removing HTML tags from the text.
=> The re
library will help in Removing Alphanumeric Text and Special Characters.
=> And As always nltk
library is useful in so many ways and we will find out how we can use it later in the road. Open the file nlp.py
and write down below code into it.
nlp.py:
# -*- coding: utf-8 -*- """ NLP sentiment analysis in python """ from bs4 import BeautifulSoup import re import nltk nltk.download('stopwords') from nltk.corpus import stopwords from nltk.stem.wordnet import WordNetLemmatizer
Removing Punctuations
The next step is to write down the code for the above-listed techniques and we will start with removing punctuations from the text. For humans, it adds value but for the machine, it doesn’t really useful.
=> The re
library will be helpful to remove Punctuations here. Open the file nlp.py
and write down below code into it.
nlp.py:
# -*- coding: utf-8 -*- """ NLP sentiment analysis in python """ def removeApostrophe(review): phrase = re.sub(r"won't", "will not", review) phrase = re.sub(r"can\'t", "can not", review) phrase = re.sub(r"n\'t", " not", review) phrase = re.sub(r"\'re", " are", review) phrase = re.sub(r"\'s", " is", review) phrase = re.sub(r"\'d", " would", review) phrase = re.sub(r"\'ll", " will", review) phrase = re.sub(r"\'t", " not", review) phrase = re.sub(r"\'ve", " have", review) phrase = re.sub(r"\'m", " am", review) return phrase
Removing HTML tags
When you get the text data from web scrapping and it is very common that you end having HTML tags in your dataset. HTML is for decorating the texts in the Web pages, which is not helpful in the Model building.
=> Here we will use The bs4
library to remove HTML tags.
=> In general, removing HTML tags good practice to follow, open the file nlp.py
and write down below code into it.
nlp.py:
# -*- coding: utf-8 -*- """ NLP sentiment analysis in python """ def removeHTMLTags(review): soup = BeautifulSoup(review, 'lxml') return soup.get_text()
Special Characters removal
You might find some word or characters in the dataset which has special characters, which are not helpful in NLP. The best example I can give you is the usage of Hashtags in comments.
=> To remove Special Characters we will use the re
library.
=>open the file nlp.py
and write down below code into it to remove Special Characters.
nlp.py:
# -*- coding: utf-8 -*- """ NLP sentiment analysis in python """ def removeSpecialChars(review): return re.sub('[^a-zA-Z]', ' ', review)
Removing AlphaNumeric words
Again, AlphaNumeric words don’t help in building a predictive model. These words don’t have meaning, so it’s better to get rid of them as well.
=> To remove Special Characters we will use the re
library.
=>open the file nlp.py
and write down below code into it to remove Special Characters.
nlp.py:
# -*- coding: utf-8 -*- """ NLP sentiment analysis in python """ def removeAlphaNumericWords(review): return re.sub("\S*\d\S*", "", review).strip()
Tokenization, Removing Stopwords, Lowercasing, and Lemmatization
In this section we will perform Tokenization, Removing Stopwords, Lowercasing, and Lemmatization. Tokenization means that parsing your text into a list of words. Basically, it helps in other pre-processing steps, such as Removing stop words which is our next point.
stopwords should be removed from the text data, these words are commonly occurring words in text data, for example, is, am, are and so on.
One of the most important steps is converting words into lower case. This will reduce duplicate copies of the same word if they are in different cases.
Lemmatization removes the inflectional endings of the word by using the vocabulary and morphological analysis of words.
=> We will create a doTextCleaning()
function, which will use the above-created methods.
=> Also, in this method we will perform Tokenization, Removing Stopwords, Lowercasing, and Lemmatization
=>open the file nlp.py
and write down below code into it.
nlp.py:
# -*- coding: utf-8 -*- """ NLP sentiment analysis in python """ def doTextCleaning(review): review = removeHTMLTags(review) review = removeApostrophe(review) review = removeAlphaNumericWords(review) review = removeSpecialChars(review) review = review.lower() # Lower casing review = review.split() # Tokenization #Removing Stopwords and Lemmatization lmtzr = WordNetLemmatizer() review = [lmtzr.lemmatize(word, 'v') for word in review if not word in set(stopwords.words('english'))] review = " ".join(review) return review
Creating Document Corpus and Advance text preprocessing
In this section will use make use of all the functions that we have created till now and we will perform Advance text preprocessing on the reviews.
Now we will create document corpus on which we will apply Bag of words model. The document corpus is a collection of all reviews in the document, where the document is your dataset.
=> In the below code we created corpus
array and we have applied for loop on our dataset.
=> In the for loop we will calldoTextCleaning()
function, which will return the cleaned text review. Once we will receive the cleaned and preprocessed text, we will appned it into the corpus
array.
=>open the file nlp.py
and write down below code into it.
nlp.py:
# -*- coding: utf-8 -*- """ NLP sentiment analysis in python """ corpus = [] for index, row in tqdm(dataset.iterrows()): review = doTextCleaning(row['Text']) corpus.append(review)
The next step is to perform Advance text preprocessing on the reviews, which will convert the reviews into Numeric Vectors and using that we can create our Machine Learning model.
Using the document corpus we create Bag of Word model along with applying Tri-grams. Bag of Word model creates a set of words or in other words, it creates a dictionary of words from the single document. Then it converts that dictionary of words into a vector, where each word is a separate dimension.
Grams(Tri-gram) are useful in creating the word dimensions from the document corpus. The Uni-grams is the default method used in BoW model while creating Vectors from the text data. Although you can specify which method should be used in the BoW model. But here we will use Tri-grams to create word dimension.
=> First thing first import theCountVectorizer
transform from scikit library.
=> We are creating the transform usingCountVectorizer
with tri-grams
.
=> You can specify which grams you want to use ngram_range
parameter, for tri-gram usengram_range=(1,3)
.
=> At the end, we have two vectors to create Machine learning model. Open the file nlp.py
and write down below code into it.
nlp.py:
# -*- coding: utf-8 -*- """ NLP sentiment analysis in python """ # Creating the Bag of Words model from sklearn.feature_extraction.text import CountVectorizer # Creating the transform with Tri-gram cv = CountVectorizer(ngram_range=(1,3), max_features = 2) X = cv.fit_transform(corpus).toarray() y = dataset.iloc[:,6].values
Building NLP sentiment analysis Machine learning model
Now last the part of the NLP sentiment analysis is to create Machine learning model. In this article, we will use the Naive Bayes classification model. I have written a separate post onNaive Bayes classification model, do read if you not familiar with the topic.
As of now, we have two vectors i.e. X
and Y
. The first step to create a machine learning model is that splitting the dataset into the Training set and Test set. Using the training set we will create a Naive Bayes classification model. Then With the test set can check the performance of a Naive Bayes classification model.
=> In the below code, first we have imported thetrain_test_split
API to split the vectors into test and traing set.
=> We have importedGaussianNB()
class to create a Naive Bayes classification model.
=> After creating the Naive Bayes classification model, then we will fit the training set into the Naive Bayes classifier. Open the file nlp.py
and write down below code into it.
nlp.py:
# -*- coding: utf-8 -*- """ NLP sentiment analysis in python """ # Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0) # Fitting Naive Bayes to the Training set from sklearn.naive_bayes import GaussianNB # Creating Naive Bayes classifier classifier = GaussianNB() # Fitting the training set into the Naive Bayes classifier classifier.fit(X_train, y_train)
NLP sentiment analysis In Action
Now that our model is ready to predict the sentiments based on the Reviews, so why not write a code to test it? By doing this we will understand how well our model is predicting the result and that our end goal as well. So the steps are very straight forward here,
=> First we have createdpredictNewReview()
function, which will ask to write a review in CMD and then it will use the above-created classifier to predict the sentiment.
=> As soon aspredictNewReview()
function will get a new review it will do all the text cleaning process usingdoTextCleaning()
function.
=> Once the text cleaning is performed, then using BOW model transform we will convert the Review the numeric vector.
=> After the conversion, the Naive Bayes classification model can be used to predict the result using classifier.predict()
method. Open the file nlp.py
and write down below code into it.
nlp.py:
# -*- coding: utf-8 -*- """ NLP sentiment analysis in python """ #Predict sentiment for new Review def predictNewReview(): newReview = input("Type the Review: ") if newReview =='': print('Invalid Review') else: newReview = doTextCleaning(newReview) reviewVector = cv.transform([newReview]).toarray() prediction = classifier.predict(reviewVector) if prediction[0] == 1: print( "Positive Review" ) else: print( "Negative Review")
Conclusion
In the post, we studied how to perform sentiment analysis on the real world data. Here, we applied the Text pre-processing on the Reviews after that we performed the advance text pre-processing followed by Ngrams.
We were able to convert text to numeric vectors and then we used the Machine learning model to predict the sentiment. Here we used the Naive Bayes classification model to predict the sentiment of any given review. For now, that’s it for this article.
If you like this article share it on your social media and spread a word about it. Also, I would like to know which machine learning model you have used to do sentiment analysis!
If you have any question ask in below comment box. Till then Happy NLP.
Code is taking too much time to process
Its been 1 hour but my data is still not processed
What is your PC configuration?
If things don’t work for you, then try with first 1000 records. Then go for 2000 and 5000 records, just to check if you get any errors. If everything is fine and you get no errors. Then one night put your system to train your model and sleep, in the morning your model should be ready to use.
PS: Just a small suggestion, you might want to save the trained model.
I get a ‘NameError: name ‘tqdm’ is not defined’
Not really sure I know why. Any insight?
Even after importing
tqdm
as shown below ?from tqdm import tqdm