Tweet Sentiment Analysis Using LSTM – AliShafaghi | Digital & Data Driven Marketing Expert

In this project I used a labeled dataset to create model for sentiment analysis of texts given to model.

The dataset contains the text of Donald Trump’s Tweet, the ex-president of united states.

Table of Contents

Steps Of The Text Sentiment Analysis Project

There are 3 main steps in this project:

Text Preprocess
Word Embedding
Training The Model

So, I will explain each step and also will show the code for each section, but if you want to check out the all of the code, I will leave the link of the notebook in Kaggle.

Dataset Link:

https://www.kaggle.com/datasets/albertsavill/all-trumps-twitter-insults-with-sentiment

Notebook Link:

https://www.kaggle.com/code/alishafaghi/trump-s-tweets-sentiment-analysis-using-lstm

Data Exploration

Before anything else, I splitted the data into test and train by ratio of 30% to 70%.
In the beginning I wanted to look at the data carefully so I used the Python Pandas library to check out the data.

First, I want to see only a few results from the top.

Code:

train= pd.read_csv("../input/trump-tweets/trump-train.csv",encoding='ISO-8859-1')


test= pd.read_csv('../input/trump-tweets/trump-test.csv')

test.head()

So this is the result.

Oouput:

Next, I check out the info of data.

So here is the code and the results of what I checked.

Code:

train.info()

Output:

Code:

test.isnull().any()

Output:

Code:

train.shape

Output:

Code:

test.shape

Output:

Next I wanted to see the count of tweets categorized in each sentiment label.

Code:

sns.set_style("whitegrid")
sns.set(rc={'figure.figsize':(11,4)})
sns.countplot(train['sentiment_label'])

Output:

Text Preprocess

The next step is to preprocess the text and remove and clean it.

In this step I removed stop words, links, extra space and punctuations. Plus I corrected the miss spellings and tokenized the text.

Frist I used the NLTK library to identify the stop words, then I created a function using REGEX or regular expression to remove the things I just mentioned.

Code:

from nltk.corpus import stopwords
# load stop words
stop_word = stopwords.words('english')

def clean(text):
    #     remove urls
    text = re.sub(r'http\S+', " ", text)
    #     remove mentions
    text = re.sub(r'@\w+',' ',text)
    #     remove hastags
    text = re.sub(r'#\w+', ' ', text)
    #     remove digits
    text = re.sub(r'\d+', ' ', text)
    #     remove html tags
    text = re.sub('r<.*?>',' ', text) 
    #     remove stop words 
    text = text.split()
    text = " ".join([word for word in text if not word in stop_word])
    
    return text

Then I applied the function to test and train data separately

train['tweet'] = train['tweet'].apply(lambda x: clean(x))
test['tweet'] = test['tweet'].apply(lambda x: clean(x))

Next, I used loc in Pandas to locate the columns I want for the test and train file to be equal to.

Code:

train = train.loc[:,['tweet','sentiment_label']]
test = test.loc[:,['tweet','sentiment_label']]

Then I used label encoder to label each sentiment.

So, negative is equal to -1, neutral equal to 0 and positive is equal to 1.

label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'Sentiment'.
train['sentiment_label']= label_encoder.fit_transform(train['sentiment_label'])
  
train['sentiment_label'].unique()

Output:

Alright so lets check again and see if the labels changed in dataset too.

Code:

train.head()

Output:

Great, so it works and as you can see and the sentiment labels are numeric.

Next, I used train and test function to split the data into train and test by 70% to 30% respectively.

Code:

train_text,val_text,train_label,val_label=train_test_split (train.tweet, train.sentiment_label, test_size=0.3,random_state=42)

Then I used lemmatization to get the unique form of each words and after that I tokenized it.

Code:

class Lemmatizer(object):
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
    def __call__(self, sentence):
        sentence=re.sub('(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)',' ',sentence)
        sentence=re.sub('[^0-9a-z]',' ',sentence)
        return [self.lemmatizer.lemmatize(word) for word in sentence.split() if len(word)>1]

tokenizer=CountVectorizer(max_features=5000,stop_words='english',lowercase=True,tokenizer=Lemmatizer())

train_x=tokenizer.fit_transform(train_text).toarray()

tokenizer.get_params()

Output:

Next , I managed the future names.

Code:

feature_names=tokenizer.get_feature_names()

val_x=tokenizer.transform(val_text).toarray()

test_x=test.tweet
test_label=label_encoder.transform(test['sentiment_label'])

test_x_1=tokenizer.transform(test_x).toarray()

Word Embedding

Next I started the word embedding process.

Code:

early_stop=EarlyStopping(monitor='val_accuracy',patience=3)
reduceLR=ReduceLROnPlateau(monitor='val_accuarcy',patience=2)

token=Tokenizer(num_words=5000,oov_token=Lemmatizer())
token.fit_on_texts(train_text)
train_x_2=token.texts_to_sequences(train_text)
train_x_2=pad_sequences(train_x_2,maxlen=60,padding='post',truncating='post')

val_x_2=token.texts_to_sequences(val_text)
val_x_2=pad_sequences(val_x_2,maxlen=60,padding='post',truncating='post')

embedding_dimension=32
v=len(token.word_index)
model=Sequential()
model.add(Input(shape=(60,)))
model.add(Embedding(v+1,embedding_dimension))
model.add(LSTM(64,return_sequences=True))
model.add(GlobalMaxPool1D())
model.add(Dense(64))
model.add(Dense(3,activation='softmax'))

Model Training

Finally I started to train the model.

Code:

model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
r=model.fit(train_x_2,train_label,validation_data=(val_x_2,val_label), epochs=50,batch_size=64,callbacks=[reduceLR,early_stop])

Output:

Next, I plotted it

Code:

plt.plot(r.history['loss'])
plt.plot(r.history['val_loss'])
plt.title('LOSS',fontdict={'size':'22'})
plt.plot()

Output:

alishafaghiSeptember 8, 2022

0 2 minutes read

Steps Of The Text Sentiment Analysis Project

Data Exploration

Text Preprocess

Word Embedding

Model Training

Related Articles

Power BI Data Analysis Project Using Bank Data

Data Exploration With SQL (Migration Flow Dataset)

Data Driven Decision Making Using Problem Solving Techniques

Data Tracking For Web Analytics

Data Exploration With SQL 2 (Covid-19 Death Dataset)

Google Ads Data Analysis Report (A Case Study Of Naked Furniture Store) Python + Excel

Leave a Reply Cancel reply