Sentiment Analysis

Sentiment Analysis

Introduction to Natural Language Processing

ยท

4 min read

Natural Language Processing is where human expression meets computational intelligence, unlocking the language of possibilities. ~ ChatGPT

Yeah, the above quote is given by the overrated AI chatbot, ChatGPT. So, how do these chatbots work? how come these understand the language of humans, and how can they assess them? well, it's termed Natural Language Processing, which simply is making a machine process our(natural) language.

How does it work?

Machines can only understand zeroes and ones, but, let's not stick to 0's and 1's and also include other numbers in our discussion(not imaginary ones, though :p). Let us say we have a sentence as follows I love Machine Learning! In this sentence, we have 4 words: I, love, Machine, Learning. Let's add them to our dictionary as 1 for I, 2 for love, 3 for Machine, 4 for Learning. Now, let's consider another sentence, I also love chocolate. This sentence contains the following words: I, also, love, chocolate. Now, we only add the words also and chocolate to our dictionary because it already contains I and love.

Now, after we have a dictionary as such, the sentences are mapped to the dictionary to create a series of numbers instead of sentences. For instance, I love Machine Learning! can be transformed as 1 2 3 4 and I also love chocolate can be transformed as 1 5 2 6. And these are further processed and fed to a model.

Great! That's how NLP works(in a nutshell. there are a lot more processing techniques ๐Ÿ”ฅ). Now let's build a small model, which can classify our text as positive sentiment or negative sentiment.

Sentiment Analyzer

We need a dataset(obviously!!). The dataset I used for this tutorial is "IMDb dataset" which has 50k reviews! You can download it here. Let's begin by reading the dataset,

dataset = pd.read_csv('../Datasets/IMDB_Dataset.csv')
print(dataset.head())

and, clean our dataset by changing the word positive to 1 and negative to 0;

clean_dataset = dataset.replace('positive',1).replace('negative',0)
print(clean_dataset.head())

splitting into train and validation,
I chose to split 80% for training and 20% for validation.

splitValue = 0.8

train = clean_dataset.sample(frac=splitValue, ignore_index=True)
validation = clean_dataset.drop(train.index)
train_labels = np.array(train['sentiment'])
validation_labels = np.array(validation['sentiment'])

now there are three important phases of building our model,
1. Tokenizing
2. Padding
3. Building...yay!

Tokenizing

As we discussed before, every text has to be converted into a stream of numbers, which we call Tokenization. Firstly, we create our vocabulary,

tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token='OOV') #Calling the Tokenizer class, passing a OOV token
tokenizer.fit_on_texts(clean_dataset['review']) #fitting the Tokenizer on our reviews, to generate a vocabulary.
๐Ÿ’ก
OOV is an acronym for out-of-vocabulary, which is a placeholder for the words that aren't in the vocabulary.

once we have the vocabulary ready, we then transform all of our sentences into sequences as follows,

train_sequences = tokenizer.texts_to_sequences(train['review'])
validation_sequences = tokenizer.texts_to_sequences(validation['review'])

Padding

We don't get all the sentences of unique length. In the same way, when we convert the sentences to sequences of numbers, we get them in different lengths. But according to legends, a model must have a unique shape across its data. so, we pad the sentences, for which we simply add zeroes at the beginning or the end. through which, we achieve the data with a unique shape.

and it's done as follows,

train_padded = tf.keras.preprocessing.sequence.pad_sequences(train_sequences, maxlen=120, truncating='post')
validation_padded = tf.keras.preprocessing.sequence.pad_sequences(validation_sequences, maxlen=120, truncating='post')

here, maxlen is the maximum length a sequence can acquire, truncating is the position of the padding(pre/post -sentence)

Building the Model

finally, we build and compile the model as follows:

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(200000, 16, input_length=120),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

Embedding is a layer especially used in NLP to convert integers to vector-based representations which help extract features among the data.

Flatten reduces data of multiple dimensions to a single dimension.
eg: [[[1,2,3],[4,5,6]]] --> [1,2,3,4,5,6]

Dense is a layer that takes input from the previous layer, applies weight and bias and passes to the next layer.

ReLU stands for Rectified Linear Unit is an activation function, which returns 0 if the output value is negative, and returns 1 if the output value is zero or positive.

Sigmoid is an activation function that takes any real-valued number as input and squashes it to a range between 0 and 1. Specifically, positive values are mapped close to 1, negative values are mapped close to 0, and the value 0 is mapped to exactly 0.5. This property makes sigmoid suitable for binary classification problems.

Compiling the model, and Training,

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_padded, train_labels, epochs=10, validation_data=(validation_padded, validation_labels))

Adam is an optimization algorithm and Binary Crossentropy is a loss function for binary classification tasks.

Yayyyy! We're done! You can get the notebook file here.

Until next time, Sree Teja Dusi.

Did you find this article valuable?

Support Sree Teja Dusi by becoming a sponsor. Any amount is appreciated!

ย