NLP — SENTIMENT ANALYSIS ON IMDB MOVIE DATASET

Archit Choudhary
4 min readSep 28, 2020

Use NLP for performing Sentiment Analysis

First of all , let’s import all the necessary packages for IMDB sentimental analysis .
Numpy , Pandas for Data Analysis and Manipulation .Seaborn and matplotlib,word cloud is used for Data Visualization .
sklearn for model Training. Nltk,textblob,spacy, for NLP operations e.g Tokenization, streaming .

The large movie view dataset contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews . The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

  • Let’s read the Data set using pd.read_csv method . and print the five top rows of Data set using head() method
O/P

Perform some operation on Dataset to Analyze it .

O/P

.value_counts() method counts the Data points in column [“sentiment”] . There are equal numbers of negative and positive counts so Data set is Balanced .

O/P

Let’s Split the Dataset in Train(train_reviews,train_sentiments) and Test(test_reviews,test_sentiments) .For model training 40 thousand rows are given . and Remaining 10 thousand rows for testing.

[O/P]: (40000,) (40000,)
(10000,) (10000,)

Now split the Group of sentences into tokens using ToktokTokenizer() . And setting the English stop words using nltk.corpus.stopwords.words(‘english’)

Lets clean the Data. Remove the unwanted Noise From the data. A Column Review consists of unwanted text which may be Decreased the Performance of the Model .
Data set consists of unwanted Html tags,special symbols etc .

Removing the Special characters [/,&,*,(),>] from the Data set

set the stop words from english vocabulary .After that remove and filtering the stop words from the data set .

O/P
O/P

A machine is understand only numeric data but data is in a form of text so lets convert the data into numeric format using CountVectorizer and Tfidvectorizer

O/P
[O/P]:Tfidf_train: (40000, 6209089)
Tfidf_test: (10000, 6209089)

Labeling and Transformed the sentiment data using LabelBinarizer()

[O/P]: (50000, 1)

Now our Data is ready for training but before it we need to do more things . split the sentiments data into train and test format

O/P

Now the time is to train the model . Applying the Algorithm on a model . first i will start with Logistic regression in which i am using the Different hyperparameters of logistic regression .After Built a model the time is to check the performance of the model if model is not work well then use the hyperparameter tuning to boost the accuracy of the model

O/P

It’s a time for prediction. Give a test Data to make a prediction. i’m make of two types
1 — Predicting the model for bag of words
2 — Predicting the model for tf idf features

[O/P] : [0 0 0 ... 0 1 1]
[0 0 0 ... 0 1 1]

Let’s check the Accuracy score for Both predictions

[O/P] :lr_bow_score : 0.7512
lr_tfidf_score : 0.75
[O/P]: precision    recall  f1-score   support

Positive 0.75 0.75 0.75 4993
Negative 0.75 0.75 0.75 5007

accuracy 0.75 10000
macro avg 0.75 0.75 0.75 10000
weighted avg 0.75 0.75 0.75 10000

precision recall f1-score support

Positive 0.74 0.77 0.75 4993
Negative 0.76 0.73 0.75 5007

accuracy 0.75 10000
macro avg 0.75 0.75 0.75 10000
weighted avg 0.75 0.75 0.75 10000
[O/P]: [[3768 1239]
[1249 3744]]
[[3663 1344]
[1156 3837]]

Now visualize the most positive and most negative words using word cloud.

O/P
O/P

WRITTEN BY: Archit Choudhary

--

--