NLP — SENTIMENT ANALYSIS ON IMDB MOVIE DATASET

4 min readSep 28, 2020

Use NLP for performing Sentiment Analysis

First of all , let’s import all the necessary packages for IMDB sentimental analysis .
Numpy , Pandas for Data Analysis and Manipulation .Seaborn and matplotlib,word cloud is used for Data Visualization .
sklearn for model Training. Nltk,textblob,spacy, for NLP operations e.g Tokenization, streaming .

The large movie view dataset contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews . The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

Let’s read the Data set using pd.read_csv method . and print the five top rows of Data set using head() method

Perform some operation on Dataset to Analyze it .

.value_counts() method counts the Data points in column [“sentiment”] . There are equal numbers of negative and positive counts so Data set is Balanced .

Let’s Split the Dataset in Train(train_reviews,train_sentiments) and Test(test_reviews,test_sentiments) .For model training 40 thousand rows are given . and Remaining 10 thousand rows for testing.

[O/P]: (40000,) (40000,)
       (10000,) (10000,)

Now split the Group of sentences into tokens using ToktokTokenizer() . And setting the English stop words using nltk.corpus.stopwords.words(‘english’)

Lets clean the Data. Remove the unwanted Noise From the data. A Column Review consists of unwanted text which may be Decreased the Performance of the Model .
Data set consists of unwanted Html tags,special symbols etc .

Removing the Special characters [/,&,*,(),>] from the Data set

set the stop words from english vocabulary .After that remove and filtering the stop words from the data set .

A machine is understand only numeric data but data is in a form of text so lets convert the data into numeric format using CountVectorizer and Tfidvectorizer

[O/P]:Tfidf_train: (40000, 6209089)
      Tfidf_test: (10000, 6209089)

Labeling and Transformed the sentiment data using LabelBinarizer()

[O/P]: (50000, 1)

Now our Data is ready for training but before it we need to do more things . split the sentiments data into train and test format

Now the time is to train the model . Applying the Algorithm on a model . first i will start with Logistic regression in which i am using the Different hyperparameters of logistic regression .After Built a model the time is to check the performance of the model if model is not work well then use the hyperparameter tuning to boost the accuracy of the model

It’s a time for prediction. Give a test Data to make a prediction. i’m make of two types
1 — Predicting the model for bag of words
2 — Predicting the model for tf idf features

[O/P] : [0 0 0 ... 0 1 1]
        [0 0 0 ... 0 1 1]

Let’s check the Accuracy score for Both predictions

[O/P] :lr_bow_score : 0.7512
       lr_tfidf_score : 0.75

[O/P]: precision    recall  f1-score   support

    Positive       0.75      0.75      0.75      4993
    Negative       0.75      0.75      0.75      5007

    accuracy                           0.75     10000
   macro avg       0.75      0.75      0.75     10000
weighted avg       0.75      0.75      0.75     10000

              precision    recall  f1-score   support

    Positive       0.74      0.77      0.75      4993
    Negative       0.76      0.73      0.75      5007

    accuracy                           0.75     10000
   macro avg       0.75      0.75      0.75     10000
weighted avg       0.75      0.75      0.75     10000

[O/P]: [[3768 1239]
        [1249 3744]]
       [[3663 1344]
        [1156 3837]]

Now visualize the most positive and most negative words using word cloud.

WRITTEN BY: Archit Choudhary

NLP — SENTIMENT ANALYSIS ON IMDB MOVIE DATASET

Use NLP for performing Sentiment Analysis

Written by Archit Choudhary