NLP — SENTIMENT ANALYSIS ON IMDB MOVIE DATASET
Use NLP for performing Sentiment Analysis
First of all , let’s import all the necessary packages for IMDB sentimental analysis .
Numpy , Pandas for Data Analysis and Manipulation .Seaborn and matplotlib,word cloud is used for Data Visualization .
sklearn for model Training. Nltk,textblob,spacy, for NLP operations e.g Tokenization, streaming .
The large movie view dataset contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews . The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.
- Let’s read the Data set using pd.read_csv method . and print the five top rows of Data set using head() method
Perform some operation on Dataset to Analyze it .
.value_counts() method counts the Data points in column [“sentiment”] . There are equal numbers of negative and positive counts so Data set is Balanced .
Let’s Split the Dataset in Train(train_reviews,train_sentiments) and Test(test_reviews,test_sentiments) .For model training 40 thousand rows are given . and Remaining 10 thousand rows for testing.
[O/P]: (40000,) (40000,)
(10000,) (10000,)
Now split the Group of sentences into tokens using ToktokTokenizer() . And setting the English stop words using nltk.corpus.stopwords.words(‘english’)
Lets clean the Data. Remove the unwanted Noise From the data. A Column Review consists of unwanted text which may be Decreased the Performance of the Model .
Data set consists of unwanted Html tags,special symbols etc .
Removing the Special characters [/,&,*,(),>] from the Data set
set the stop words from english vocabulary .After that remove and filtering the stop words from the data set .
A machine is understand only numeric data but data is in a form of text so lets convert the data into numeric format using CountVectorizer and Tfidvectorizer
[O/P]:Tfidf_train: (40000, 6209089)
Tfidf_test: (10000, 6209089)
Labeling and Transformed the sentiment data using LabelBinarizer()
[O/P]: (50000, 1)
Now our Data is ready for training but before it we need to do more things . split the sentiments data into train and test format
Now the time is to train the model . Applying the Algorithm on a model . first i will start with Logistic regression in which i am using the Different hyperparameters of logistic regression .After Built a model the time is to check the performance of the model if model is not work well then use the hyperparameter tuning to boost the accuracy of the model
It’s a time for prediction. Give a test Data to make a prediction. i’m make of two types
1 — Predicting the model for bag of words
2 — Predicting the model for tf idf features
[O/P] : [0 0 0 ... 0 1 1]
[0 0 0 ... 0 1 1]
Let’s check the Accuracy score for Both predictions
[O/P] :lr_bow_score : 0.7512
lr_tfidf_score : 0.75
[O/P]: precision recall f1-score support
Positive 0.75 0.75 0.75 4993
Negative 0.75 0.75 0.75 5007
accuracy 0.75 10000
macro avg 0.75 0.75 0.75 10000
weighted avg 0.75 0.75 0.75 10000
precision recall f1-score support
Positive 0.74 0.77 0.75 4993
Negative 0.76 0.73 0.75 5007
accuracy 0.75 10000
macro avg 0.75 0.75 0.75 10000
weighted avg 0.75 0.75 0.75 10000
[O/P]: [[3768 1239]
[1249 3744]]
[[3663 1344]
[1156 3837]]
Now visualize the most positive and most negative words using word cloud.
WRITTEN BY: Archit Choudhary