# mount gdrive
from google.colab import drive
'/content/drive') drive.mount(
Background
The Brexit was a term that refers to the withdrawal of the United Kingdom (UK) from the European Union (EU) after 40 years of relationship. Officially, the UK left on 31 January 2020, marking it as the first and sole country to ever left the EU. The term ‘Brexit’ refers to a combination of words Britain and exit. As Brexit has significant implications to the people of the UK, diversing opinions (positively and negatively) arose with the event. Some argue the merits of Brexit including more control over democracy, borders, and money that would improve several areas, e.g., healthcare, costumer rights, and environment. On the other end, people opposes the idea as the decision impact negatively in trade, migration, and investments. This complexity and delicacy are present in the social media discussions such as in Twitter.
This is the first part of on the analysis of Brexit polarity tweets, which is the exploratory analysis part. The project aims to build a neural network-based classifier to predict whether a tweet is created by a user who supports or opposes Brexit. This analysis leverage data from Kaggle: Brexit Polarity Tweets.
The project’s Github repository can be accessed here.
About the dataset
These datasets were collated as part of a dissertation project. This Twitter dataset covers the January - March 2022 period and comprises tweets relating to Brexit or Europe from Twitter accounts with publicly stated Brexit positions in their bio. It was collected using Boolean search for both types of users.
The Boolean search for pro-Brexit tweet is:
(bio:“Brexit support” OR bio:“pro-brexit” OR bio:“pro brexit” OR bio:“Pro #Brexit” OR bio:brexiteer OR bio:probrexit) AND (EU OR Brexit OR CUSTOMS OR EUROPEAN OR EUROPE OR #Remain OR *Brexit OR #rejoinEU)
The Boolean search for anti-Brexit tweet is:
(bio:“anti brexit” OR bio:“anti-brexit” OR bio:“antibrexit” OR bio:“Pro remain” OR bio:“pro-remain” OR bio:remainer) AND (EU OR BREXIT OR CUSTOMS OR EUROPEAN OR EUROPE OR #Remain OR *Brexit)
1. Environment Setup
The notebook was run on the Google Colab platform which provides additional functionalities such as Google Drive connectivity and pre-installed Kaggle API. For setting up the analysis, several task was performed, including:
- Mounting to google drive for Kaggle API credential
- Downloading data directly from Kaggle using Kaggle API
- Downloading GloVe6B dataset for embedding language data
- Importing essential libraries (Numpy, Pandas, Scikit-learn, Tensorflow2, etc.)
- Specifying some constant variables
# download Brexit dataset
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/.credentials/kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download visalakshiiyer/twitter-data-brexit
!unzip -d data/ twitter-data-brexit.zip
# download GloVe6B dataset
!wget 'https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip'
from shutil import unpack_archive
import os
# unzip file
'glove.6B.zip')
unpack_archive('glove.6B.300d.txt')
os.remove('glove.6B.200d.txt')
os.remove(# os.remove('glove.6B.100d.txt')
'glove.6B.50d.txt')
os.remove('glove.6B.zip') os.remove(
import numpy as np
import pandas as pd
import tensorflow as tf
import nltk
import re
import string
import pickle
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import (
Embedding, Conv1D, MaxPooling1D, Bidirectional, LSTM, GRU, SimpleRNN,
Dense, Dropout )
'stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download(
clear_output()
= LabelEncoder()
label_encoder = WordNetLemmatizer()
lemmatizer = set(nltk.corpus.stopwords.words('english')) stopwords
# variables related to dataset and glove data
= '/content/data/TweetDataset_AntiBrexit_Jan-Mar2022.csv'
INPUT_PATH_ANTI = '/content/data/TweetDataset_ProBrexit_Jan-Mar2022.csv'
INPUT_PATH_PRO = '/content/glove.6B.100d.txt'
INPUT_PATH_GLOVE
# variables related to model checkpoints
= '/content/drive/MyDrive/Projects/Big Projects/Sentiment Analysis using Deep Learning/history/history.pkl'
HISTORY_PATH = '/content/drive/MyDrive/Projects/Big Projects/Sentiment Analysis using Deep Learning/checkpoint/cp-{epoch:02d}.ckpt'
CHECKPOINT_PATH = os.path.dirname(CHECKPOINT_PATH)
CHECKPOINT_DIR
# variables related to modelling process
= 30_000
num_words = 100_000
row_limit = 100
embedding_dim = 0.10
test_split = 0.10 / 0.90 val_split
2. Data Preparation
As all setup completed, we can prepare the data for training the model. This is done in several steps ranging from importing the dataset to cleaning and tokenizing data to embedding words. Specifically, the process of preparing data includes:
- Importing data (pro and anti tweets)
- Sampling with a specified size (in this case the sample size is 100,000)
- Cleaning data (remove unwanted parts such as emoticon and URLs, lemmatization, etc.)
- Splitting data into
train
,test
, andvalidation
- Embedding words using the pre-trained GloVe Embeddings
# import data
= pd.read_csv(INPUT_PATH_PRO)
tweets_pro = pd.read_csv(INPUT_PATH_ANTI) tweets_anti
# specify sample indices from data
= np.random.choice(len(tweets_pro), replace = False, size = row_limit)
ind_pro = np.random.choice(len(tweets_anti), replace = False, size = row_limit) ind_anti
# binds pro-brexit and anti-brexit tables
assert np.mean(tweets_pro.dtypes == tweets_anti.dtypes) == 1
assert np.mean(tweets_pro.columns == tweets_anti.columns) == 1
# create dataset for modelling
= pd.concat([tweets_pro["Hit Sentence"][ind_pro],
tweets "Hit Sentence"][ind_anti]])
tweets_anti[= tweets.reset_index(drop = True)
tweets = pd.Series(np.concatenate([np.repeat(["Pro"], row_limit),
labels "Anti"], row_limit)]))
np.repeat([
del tweets_pro
del tweets_anti
# Create pre-processing functions
def remove_qt_rt_uname(text):
= re.compile(r'(RT|QT)? ?@[\w]+:?')
qt_rt return qt_rt.sub(r'', text)
def remove_URL(text):
= re.compile(r'https?://\S+|www\.\S+')
url return url.sub(r'', text)
def remove_html(text):
= re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
html return html.sub(r'', text)
def remove_punct(text):
= str.maketrans('', '', string.punctuation)
table return text.translate(table)
def remove_emoji(text):
= re.compile(
emoji_pattern '['
u'\U0001F600-\U0001F64F' # emoticons
u'\U0001F300-\U0001F5FF' # symbols & pictographs
u'\U0001F680-\U0001F6FF' # transport & map symbols
u'\U0001F1E0-\U0001F1FF' # flags (iOS)
u'\U00002702-\U000027B0'
u'\U000024C2-\U0001F251'
']+',
=re.UNICODE
flags
)return emoji_pattern.sub(r'', text)
def remove_stopwords(text, stopwords = stopwords):
return " ".join([w for w in text.split(" ") if w.lower() not in stopwords])
def lemmatize(text, lemmatizer = lemmatizer):
return " ".join([lemmatizer.lemmatize(w) for w in text.split(" ")])
# Pre-process data
= tweets.apply(lambda tweet: remove_qt_rt_uname(tweet)) \
tweets apply(lambda tweet: remove_URL(tweet)) \
.apply(lambda tweet: remove_emoji(tweet)) \
.apply(lambda tweet: remove_html(tweet)) \
.apply(lambda tweet: remove_stopwords(tweet)) \
.apply(lambda tweet: remove_punct(tweet)) \
.apply(lambda tweet: lemmatize(tweet)) .
if len(tweets) >= 50000:
= 5000 / len(tweets)
test_split = 5000 / (len(tweets) * (1 - test_split)) val_split
# Split data into train, test, and validation
= train_test_split(
X_train, X_test, y_train, y_test
tweets,
labels, = labels,
stratify = test_split,
test_size = 321
random_state
)
= train_test_split(
X_train, X_val, y_train, y_val
X_train,
y_train, = y_train,
stratify = val_split,
test_size = 321)
random_state
# Check Data
print(f"Train Dataset Size = {len(X_train)}")
print(f"Test Dataset Size = {len(X_test)}")
print(f"Val Dataset Size = {len(X_val)}")
Train Dataset Size = 190000
Test Dataset Size = 5000
Val Dataset Size = 5000
# Tokenize words
= Tokenizer(num_words = num_words, oov_token = "<<OOV>>")
tokenizer
tokenizer.fit_on_texts(X_train)
# Create sequences
= tokenizer.texts_to_sequences(X_train)
sequences_train = tokenizer.texts_to_sequences(X_test)
sequences_test = tokenizer.texts_to_sequences(X_val)
sequences_val
= pad_sequences(sequences_train, maxlen=256, truncating='pre')
X_train = pad_sequences(sequences_test, maxlen=256, truncating='pre')
X_test = pad_sequences(sequences_val, maxlen=256, truncating='pre')
X_val
# Encode labels
= label_encoder.fit_transform(y_train)
y_train = label_encoder.fit_transform(y_test)
y_test = label_encoder.fit_transform(y_val)
y_val
# Check data
= len(tokenizer.index_word) + 1
vocabSize print(f"Vocabulary size = {vocabSize}")
print(f"X train shape = {X_train.shape}")
print(f"X val shape = {y_val.shape}")
print(f"X test shape = {y_train.shape}")
print(f"y train shape = {X_test.shape}")
print(f"y val shape = {y_val.shape}")
print(f"y test shape = {y_test.shape}")
Vocabulary size = 76835
X train shape = (190000, 256)
X val shape = (5000,)
X test shape = (190000,)
y train shape = (5000, 256)
y val shape = (5000,)
y test shape = (5000,)
= {}
embeddings_index = vocabSize
num_tokens
# Read word vectors
with open(INPUT_PATH_GLOVE) as f:
for line in f:
= line.split(maxsplit=1)
word, coefs = lemmatizer.lemmatize(word)
word = np.fromstring(coefs, "f", sep=" ")
coefs = coefs
embeddings_index[word]
# Assign word vectors to our dictionary/vocabulary
= np.zeros((num_tokens, embedding_dim))
embedding_matrix for word, i in tokenizer.word_index.items():
= embeddings_index.get(word)
embedding_vector if embedding_vector is not None:
# Words not found in embedding index will be all-zeros.
# This includes the representation for "padding" and "OOV"
= embedding_vector embedding_matrix[i]
3. Model Building
The next step is to build the model. This process contains several tasks.
- The first thing is to configure relevant aspects with regards to training. Here I define five callback functions: early stopping, learning rate scheduler, learning rate reducer, model checkpoint, and training terminator given
NaN
loss value. - The second task is to define the model. I wrote
create_model
function as the function to generate the model. I defined the model architecture combining convolutional and recurrent layer types with some attributes.- An embedding layer to convert input sequences into its vector representation based on GloVe embedding whose weights are updated during training
- Two layers of 1-D convolutional layer followed by pooling layer based on maximum value
- A RNN layer
- A dense layer
- A L-2 regularizers which is implemented in the kernel and recurrent regularizers
- Some dropouts layers
- Lastly, model is trained with a maximum of 30 epochs using training data and validation data.
# define callback functions
= tf.keras.callbacks.EarlyStopping(
early_stopping = 'val_loss',
monitor = 0.001,
min_delta = 10
patience
)
= tf.keras.callbacks.LearningRateScheduler(
lr_scheduler lambda epoch, lr: lr if epoch < 25 else lr * tf.math.exp(-0.01)
)
= tf.keras.callbacks.ReduceLROnPlateau(
lr_reducer = 'val_loss',
monitor = 0.9,
factor = 3,
patience = 0.001,
min_delta
)
= tf.keras.callbacks.ModelCheckpoint(
check_point = CHECKPOINT_PATH,
filepath = 0,
verbose = True,
save_weights_only ='epoch'
save_freq
)
= tf.keras.callbacks.TerminateOnNaN()
nan_terminator
= [
callbacks
early_stopping,
lr_scheduler,
lr_reducer,
nan_terminator,
check_point
]
= 1e-4 learning_rate
def create_model(num_tokens, embedding_dim, embedding_matrix):
= tf.keras.regularizers.l2(0.0001)
regularizer = tf.keras.initializers.Constant(embedding_matrix)
embeddings = tf.keras.losses.BinaryCrossentropy()
loss = tf.keras.optimizers.Adam(learning_rate = learning_rate)
optimizer = ["accuracy"]
metrics
= tf.keras.Sequential([
model
Embedding(num_tokens,
embedding_dim,= embeddings,
embeddings_initializer = True),
trainable = 64,
Conv1D(filters = 3,
kernel_size = "causal",
padding = "relu"),
activation 0.4),
Dropout(= 2),
MaxPooling1D(pool_size = 216,
Conv1D(filters = 3,
kernel_size = "causal",
padding = "relu"),
activation 0.4),
Dropout(= 2),
MaxPooling1D(pool_size 128,
SimpleRNN(= "relu",
activation = regularizer,
kernel_regularizer = regularizer),
recurrent_regularizer 0.4),
Dropout(1024, activation='relu', kernel_regularizer = regularizer),
Dense(0.4),
Dropout(1, activation='sigmoid')
Dense(
])
compile(loss = loss,
model.= optimizer,
optimizer = metrics)
metrics
return model
= create_model(num_tokens, embedding_dim, embedding_matrix)
model model.summary()
Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_3 (Embedding) (None, None, 100) 7683500
conv1d_6 (Conv1D) (None, None, 64) 19264
dropout_12 (Dropout) (None, None, 64) 0
max_pooling1d_6 (MaxPooling (None, None, 64) 0
1D)
conv1d_7 (Conv1D) (None, None, 216) 41688
dropout_13 (Dropout) (None, None, 216) 0
max_pooling1d_7 (MaxPooling (None, None, 216) 0
1D)
simple_rnn_3 (SimpleRNN) (None, 128) 44160
dropout_14 (Dropout) (None, 128) 0
dense_6 (Dense) (None, 1024) 132096
dropout_15 (Dropout) (None, 1024) 0
dense_7 (Dense) (None, 1) 1025
=================================================================
Total params: 7,921,733
Trainable params: 7,921,733
Non-trainable params: 0
_________________________________________________________________
= model.fit(
history
X_train,
y_train, = 30,
epochs = (X_val, y_val),
validation_data = 1,
verbose = callbacks
callbacks )
Epoch 1/30
5938/5938 [==============================] - 407s 68ms/step - loss: 0.6638 - accuracy: 0.6349 - val_loss: 0.5620 - val_accuracy: 0.7490 - lr: 1.0000e-04
Epoch 2/30
5938/5938 [==============================] - 406s 68ms/step - loss: 0.4963 - accuracy: 0.7835 - val_loss: 0.4800 - val_accuracy: 0.8124 - lr: 1.0000e-04
Epoch 3/30
5938/5938 [==============================] - 402s 68ms/step - loss: 0.4224 - accuracy: 0.8249 - val_loss: 0.4390 - val_accuracy: 0.8328 - lr: 1.0000e-04
Epoch 4/30
5938/5938 [==============================] - 398s 67ms/step - loss: 0.3820 - accuracy: 0.8451 - val_loss: 0.4043 - val_accuracy: 0.8382 - lr: 1.0000e-04
Epoch 5/30
5938/5938 [==============================] - 396s 67ms/step - loss: 0.3552 - accuracy: 0.8584 - val_loss: 0.3784 - val_accuracy: 0.8490 - lr: 1.0000e-04
Epoch 6/30
5938/5938 [==============================] - 399s 67ms/step - loss: 0.3334 - accuracy: 0.8683 - val_loss: 0.3654 - val_accuracy: 0.8558 - lr: 1.0000e-04
Epoch 7/30
5938/5938 [==============================] - 399s 67ms/step - loss: 0.3187 - accuracy: 0.8740 - val_loss: 0.3562 - val_accuracy: 0.8584 - lr: 1.0000e-04
Epoch 8/30
5938/5938 [==============================] - 401s 68ms/step - loss: 0.3070 - accuracy: 0.8798 - val_loss: 0.3542 - val_accuracy: 0.8598 - lr: 1.0000e-04
Epoch 9/30
5938/5938 [==============================] - 403s 68ms/step - loss: 0.2959 - accuracy: 0.8854 - val_loss: 0.3419 - val_accuracy: 0.8642 - lr: 1.0000e-04
Epoch 10/30
5938/5938 [==============================] - 402s 68ms/step - loss: 0.2872 - accuracy: 0.8883 - val_loss: 0.3418 - val_accuracy: 0.8652 - lr: 1.0000e-04
Epoch 11/30
5938/5938 [==============================] - 398s 67ms/step - loss: 0.2781 - accuracy: 0.8931 - val_loss: 0.3390 - val_accuracy: 0.8682 - lr: 1.0000e-04
Epoch 12/30
5938/5938 [==============================] - 400s 67ms/step - loss: 0.2709 - accuracy: 0.8953 - val_loss: 0.3362 - val_accuracy: 0.8690 - lr: 1.0000e-04
Epoch 13/30
5938/5938 [==============================] - 396s 67ms/step - loss: 0.2654 - accuracy: 0.8985 - val_loss: 0.3373 - val_accuracy: 0.8704 - lr: 1.0000e-04
Epoch 14/30
5938/5938 [==============================] - 397s 67ms/step - loss: 0.2594 - accuracy: 0.9012 - val_loss: 0.3358 - val_accuracy: 0.8696 - lr: 1.0000e-04
Epoch 15/30
5938/5938 [==============================] - 396s 67ms/step - loss: 0.2532 - accuracy: 0.9042 - val_loss: 0.3396 - val_accuracy: 0.8668 - lr: 1.0000e-04
Epoch 16/30
5938/5938 [==============================] - 395s 67ms/step - loss: 0.2475 - accuracy: 0.9065 - val_loss: 0.3390 - val_accuracy: 0.8672 - lr: 9.0000e-05
Epoch 17/30
5938/5938 [==============================] - 395s 66ms/step - loss: 0.2432 - accuracy: 0.9082 - val_loss: 0.3370 - val_accuracy: 0.8698 - lr: 9.0000e-05
Epoch 18/30
5938/5938 [==============================] - 400s 67ms/step - loss: 0.2392 - accuracy: 0.9096 - val_loss: 0.3379 - val_accuracy: 0.8682 - lr: 9.0000e-05
Epoch 19/30
5938/5938 [==============================] - 404s 68ms/step - loss: 0.2333 - accuracy: 0.9122 - val_loss: 0.3355 - val_accuracy: 0.8692 - lr: 8.1000e-05
Epoch 20/30
5938/5938 [==============================] - 398s 67ms/step - loss: 0.2299 - accuracy: 0.9144 - val_loss: 0.3375 - val_accuracy: 0.8684 - lr: 8.1000e-05
Epoch 21/30
5938/5938 [==============================] - 396s 67ms/step - loss: 0.2268 - accuracy: 0.9153 - val_loss: 0.3382 - val_accuracy: 0.8696 - lr: 8.1000e-05
Epoch 22/30
5938/5938 [==============================] - 393s 66ms/step - loss: 0.2227 - accuracy: 0.9172 - val_loss: 0.3427 - val_accuracy: 0.8668 - lr: 7.2900e-05
4. Model Evaluation
Trained model is evaluated using testing data. The results show that the model is capable to predict the classification of tweets related to Brexit with the score of 86.3% accuracy.
model.evaluate(X_test, y_test)
157/157 [==============================] - 1s 8ms/step - loss: 0.3450 - accuracy: 0.8632
[0.34497830271720886, 0.8632000088691711]
5. Conclusion
Based on the whole process, it can be concluded that a Deep Learning model could find the pattern to differentiate between pro- and anti-Brexit tweets. The model can predict with about 86 percent accuracy. Furthermore, combining between convolutional and recurrent network is proven to be working for this type of data. The different architectures were also attempted to produce (e.g., pure neural network, pure recurrent neural network, pure convolutional neural network, and neural networks with LSTM layers), but most were not as optimal as this architecture in terms of model performance and training speed. The analysis also showed that pre-trained word embeddings can be used in training a deep learning model with natural language data.