NLP sentiment analysis in python

Sentiment Analysis is one of the most used branches of Natural language processing. With the help of Sentiment Analysis, we humans can determine whether the text is showing positive or negative sentiment and this is done using both NLP and machine learning. Sentiment Analysis is also called as Opinion mining. In this article, we will learn about NLP sentiment analysis in python.

From reducing churn to increase sales of the product, creating brand awareness and analyzing the reviews of customers and improving the products, these are some of the vital application of Sentiment analysis. Here, we will implement a machine learning model which will predict the sentiment of customer reviews and we will cover below-listed topics,

The problem statement
Feature extraction and Text pre-processing
Advance text preprocessing
Sentiment Analysis by building a machine learning model

Download

Prerequisite

The problem statement

One of the popular application of sentiment analysis is predicting sentiment of customer reviews. This is helpful for banking, eCommerce in fact in all domains where you are selling some product to customers. As I said earlier we will implement the machine learning model which will predict the Positive or Negative sentiment based on the reviews of the product.

Basically, we will create a machine learning model which will predict if the new incoming customer review is positive or negative. For this article, we will use amazon’s food review dataset available atkaggle.

Feature extraction and Text pre-processing

Machines can not understand English or any text data by default. The text data needs a special preparation before you can give text data to the machine to predict something out of it. That special preparation includes several steps such as removing stops words, correcting spelling mistakes, removing meaningless words, removing rare words and many more.

The first step of preparing text data is applying feature extraction and basic text pre-processing. In feature extraction and basic text pre-processing there several steps as follows,

Removing Punctuations
Removing HTML tags
Special Characters removal
Removing AlphaNumeric words
Tokenization
Removal of Stopwords
Lower casing
Lemmatization

=> Before we apply feature extraction and text pre-processing first we will load out dataset using pandas library. => Create file nlp.py and write down below code into it.

nlp.py:

# -*- coding: utf-8 -*-
"""
NLP sentiment analysis in python
"""
 
import pandas as pd
 
# Importing the dataset
dataset = pd.read_csv('Reviews.csv')

Let’s import libraries for text pre-processing and later we will use these libraries to do the basic text pre-processing.

=> We have imported bs4 for Removing HTML tags from the text.

=> The relibrary will help in Removing Alphanumeric Text and Special Characters.

=> And As always nltk library is useful in so many ways and we will find out how we can use it later in the road. Open the file nlp.py and write down below code into it.

nlp.py:

# -*- coding: utf-8 -*-
"""
NLP sentiment analysis in python
"""

from bs4 import BeautifulSoup
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

Removing Punctuations

The next step is to write down the code for the above-listed techniques and we will start with removing punctuations from the text. For humans, it adds value but for the machine, it doesn’t really useful.

=> The relibrary will be helpful to remove Punctuations here. Open the file nlp.py and write down below code into it.

nlp.py:

# -*- coding: utf-8 -*-
"""
NLP sentiment analysis in python
"""

def removeApostrophe(review):
    phrase = re.sub(r"won't", "will not", review)
    phrase = re.sub(r"can\'t", "can not", review)
    phrase = re.sub(r"n\'t", " not", review)
    phrase = re.sub(r"\'re", " are", review)
    phrase = re.sub(r"\'s", " is", review)
    phrase = re.sub(r"\'d", " would", review)
    phrase = re.sub(r"\'ll", " will", review)
    phrase = re.sub(r"\'t", " not", review)
    phrase = re.sub(r"\'ve", " have", review)
    phrase = re.sub(r"\'m", " am", review)
    return phrase

Removing HTML tags

When you get the text data from web scrapping and it is very common that you end having HTML tags in your dataset. HTML is for decorating the texts in the Web pages, which is not helpful in the Model building.

=> Here we will use The bs4library to remove HTML tags.

=> In general, removing HTML tags good practice to follow, open the file nlp.py and write down below code into it.

nlp.py:

# -*- coding: utf-8 -*-
"""
NLP sentiment analysis in python
"""

def removeHTMLTags(review):
    soup = BeautifulSoup(review, 'lxml')
    return soup.get_text()

Special Characters removal

You might find some word or characters in the dataset which has special characters, which are not helpful in NLP. The best example I can give you is the usage of Hashtags in comments.

=> To remove Special Characters we will use the relibrary.

=>open the file nlp.py and write down below code into it to remove Special Characters.

nlp.py:

# -*- coding: utf-8 -*-
"""
NLP sentiment analysis in python
"""

def removeSpecialChars(review):
    return re.sub('[^a-zA-Z]', ' ', review)

Removing AlphaNumeric words

Again, AlphaNumeric words don’t help in building a predictive model. These words don’t have meaning, so it’s better to get rid of them as well.

=> To remove Special Characters we will use the relibrary.

=>open the file nlp.py and write down below code into it to remove Special Characters.

nlp.py:

# -*- coding: utf-8 -*-
"""
NLP sentiment analysis in python
"""

def removeAlphaNumericWords(review):
    return re.sub("\S*\d\S*", "", review).strip()

Tokenization, Removing Stopwords, Lowercasing, and Lemmatization

In this section we will perform Tokenization, Removing Stopwords, Lowercasing, and Lemmatization. Tokenization means that parsing your text into a list of words. Basically, it helps in other pre-processing steps, such as Removing stop words which is our next point.
stopwords should be removed from the text data, these words are commonly occurring words in text data, for example, is, am, are and so on.
One of the most important steps is converting words into lower case. This will reduce duplicate copies of the same word if they are in different cases.
Lemmatization removes the inflectional endings of the word by using the vocabulary and morphological analysis of words.

=> We will create a doTextCleaning() function, which will use the above-created methods.

=> Also, in this method we will perform Tokenization, Removing Stopwords, Lowercasing, and Lemmatization

=>open the file nlp.py and write down below code into it.

nlp.py:

# -*- coding: utf-8 -*-
"""
NLP sentiment analysis in python
"""

def doTextCleaning(review):
    review = removeHTMLTags(review)
    review = removeApostrophe(review)
    review = removeAlphaNumericWords(review)
    review = removeSpecialChars(review) 

    review = review.lower()  # Lower casing
    review = review.split()  # Tokenization
    
    #Removing Stopwords and Lemmatization
    lmtzr = WordNetLemmatizer()
    review = [lmtzr.lemmatize(word, 'v') for word in review if not word in set(stopwords.words('english'))]
    
    review = " ".join(review)    
    return review

Creating Document Corpus and Advance text preprocessing

In this section will use make use of all the functions that we have created till now and we will perform Advance text preprocessing on the reviews.

Now we will create document corpus on which we will apply Bag of words model. The document corpus is a collection of all reviews in the document, where the document is your dataset.

=> In the below code we created corpus array and we have applied for loop on our dataset.

=> In the for loop we will calldoTextCleaning() function, which will return the cleaned text review. Once we will receive the cleaned and preprocessed text, we will appned it into the corpus array.

=>open the file nlp.py and write down below code into it.

nlp.py:

# -*- coding: utf-8 -*-
"""
NLP sentiment analysis in python
"""

corpus = []   
for index, row in tqdm(dataset.iterrows()):
    review = doTextCleaning(row['Text'])
    corpus.append(review)

The next step is to perform Advance text preprocessing on the reviews, which will convert the reviews into Numeric Vectors and using that we can create our Machine Learning model.

Using the document corpus we create Bag of Word model along with applying Tri-grams. Bag of Word model creates a set of words or in other words, it creates a dictionary of words from the single document. Then it converts that dictionary of words into a vector, where each word is a separate dimension.

Grams(Tri-gram) are useful in creating the word dimensions from the document corpus. The Uni-grams is the default method used in BoW model while creating Vectors from the text data. Although you can specify which method should be used in the BoW model. But here we will use Tri-grams to create word dimension.

=> First thing first import theCountVectorizer transform from scikit library.

=> We are creating the transform usingCountVectorizer with tri-grams.

=> You can specify which grams you want to use ngram_range parameter, for tri-gram usengram_range=(1,3).

=> At the end, we have two vectors to create Machine learning model. Open the file nlp.py and write down below code into it.

nlp.py:

# -*- coding: utf-8 -*-
"""
NLP sentiment analysis in python
"""

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer

# Creating the transform with Tri-gram
cv = CountVectorizer(ngram_range=(1,3), max_features = 2)

X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,6].values

Building NLP sentiment analysis Machine learning model

Now last the part of the NLP sentiment analysis is to create Machine learning model. In this article, we will use the Naive Bayes classification model. I have written a separate post onNaive Bayes classification model, do read if you not familiar with the topic.

As of now, we have two vectors i.e. X and Y. The first step to create a machine learning model is that splitting the dataset into the Training set and Test set. Using the training set we will create a Naive Bayes classification model. Then With the test set can check the performance of a Naive Bayes classification model.

=> In the below code, first we have imported thetrain_test_split API to split the vectors into test and traing set.

=> We have importedGaussianNB() class to create a Naive Bayes classification model.

=> After creating the Naive Bayes classification model, then we will fit the training set into the Naive Bayes classifier. Open the file nlp.py and write down below code into it.

nlp.py:

# -*- coding: utf-8 -*-
"""
NLP sentiment analysis in python
"""

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB

# Creating Naive Bayes classifier
classifier = GaussianNB()

# Fitting the training set into the Naive Bayes classifier
classifier.fit(X_train, y_train)

NLP sentiment analysis In Action

Now that our model is ready to predict the sentiments based on the Reviews, so why not write a code to test it? By doing this we will understand how well our model is predicting the result and that our end goal as well. So the steps are very straight forward here,

=> First we have createdpredictNewReview() function, which will ask to write a review in CMD and then it will use the above-created classifier to predict the sentiment.

=> As soon aspredictNewReview() function will get a new review it will do all the text cleaning process usingdoTextCleaning() function.

=> Once the text cleaning is performed, then using BOW model transform we will convert the Review the numeric vector.

=> After the conversion, the Naive Bayes classification model can be used to predict the result using classifier.predict() method. Open the file nlp.py and write down below code into it.

nlp.py:

# -*- coding: utf-8 -*-
"""
NLP sentiment analysis in python
"""

#Predict sentiment for new Review
def predictNewReview():
    newReview = input("Type the Review: ")
    
    if newReview =='':
        print('Invalid Review')  
    else:
        newReview = doTextCleaning(newReview)
        reviewVector = cv.transform([newReview]).toarray()  
        prediction =  classifier.predict(reviewVector)
        if prediction[0] == 1:
            print( "Positive Review" )
        else:        
            print( "Negative Review")

Conclusion

In the post, we studied how to perform sentiment analysis on the real world data. Here, we applied the Text pre-processing on the Reviews after that we performed the advance text pre-processing followed by Ngrams.

We were able to convert text to numeric vectors and then we used the Machine learning model to predict the sentiment. Here we used the Naive Bayes classification model to predict the sentiment of any given review. For now, that’s it for this article.

If you like this article share it on your social media and spread a word about it. Also, I would like to know which machine learning model you have used to do sentiment analysis!

If you have any question ask in below comment box. Till then Happy NLP.

Comments 4

Rohit Dalal says:

6 years ago

Code is taking too much time to process
Its been 1 hour but my data is still not processed

- Shashank Tiwari says:
  
  6 years ago
  
  What is your PC configuration?
  If things don’t work for you, then try with first 1000 records. Then go for 2000 and 5000 records, just to check if you get any errors. If everything is fine and you get no errors. Then one night put your system to train your model and sleep, in the morning your model should be ready to use.
  PS: Just a small suggestion, you might want to save the trained model.
  
Future_Vision says:

6 years ago

I get a ‘NameError: name ‘tqdm’ is not defined’
Not really sure I know why. Any insight?

- Shashank Tiwari says:
  
  6 years ago
  
  Even after importing tqdm as shown below ?
  from tqdm import tqdm

NLP sentiment analysis in python

Understand the fundamentals of sentiment analysis and create a sentiment analysis machine learning model in python

How to Prepare Text Data for Machine Learning with scikit-learn

The Hunger Games Guide to Exploratory Data Analysis plotting in Python

Related Posts

The Hunger Games Guide to Exploratory Data Analysis plotting in Python

How to Prepare Text Data for Machine Learning with scikit-learn

Saving machine learning Model in python with Scikit Learn

The Hunger Games Guide to Exploratory Data Analysis plotting in Python

Comments 4

Leave a Reply Cancel reply