Hanzholah Shobri - Brexit Polarity Tweets: Predictive Models (Part 2/2)

Background

The Brexit was a term that refers to the withdrawal of the United Kingdom (UK) from the European Union (EU) after 40 years of relationship. Officially, the UK left on 31 January 2020, marking it as the first and sole country to ever left the EU. The term ‘Brexit’ refers to a combination of words Britain and exit. As Brexit has significant implications to the people of the UK, diversing opinions (positively and negatively) arose with the event. Some argue the merits of Brexit including more control over democracy, borders, and money that would improve several areas, e.g., healthcare, costumer rights, and environment. On the other end, people opposes the idea as the decision impact negatively in trade, migration, and investments. This complexity and delicacy are present in the social media discussions such as in Twitter.

This is the first part of on the analysis of Brexit polarity tweets, which is the exploratory analysis part. The project aims to build a neural network-based classifier to predict whether a tweet is created by a user who supports or opposes Brexit. This analysis leverage data from Kaggle: Brexit Polarity Tweets.

The project’s Github repository can be accessed here.

About the dataset

These datasets were collated as part of a dissertation project. This Twitter dataset covers the January - March 2022 period and comprises tweets relating to Brexit or Europe from Twitter accounts with publicly stated Brexit positions in their bio. It was collected using Boolean search for both types of users.

The Boolean search for pro-Brexit tweet is:

(bio:“Brexit support” OR bio:“pro-brexit” OR bio:“pro brexit” OR bio:“Pro #Brexit” OR bio:brexiteer OR bio:probrexit) AND (EU OR Brexit OR CUSTOMS OR EUROPEAN OR EUROPE OR #Remain OR *Brexit OR #rejoinEU)

The Boolean search for anti-Brexit tweet is:

(bio:“anti brexit” OR bio:“anti-brexit” OR bio:“antibrexit” OR bio:“Pro remain” OR bio:“pro-remain” OR bio:remainer) AND (EU OR BREXIT OR CUSTOMS OR EUROPEAN OR EUROPE OR #Remain OR *Brexit)

1. Environment Setup

The notebook was run on the Google Colab platform which provides additional functionalities such as Google Drive connectivity and pre-installed Kaggle API. For setting up the analysis, several task was performed, including:

Mounting to google drive for Kaggle API credential
Downloading data directly from Kaggle using Kaggle API
Downloading GloVe6B dataset for embedding language data
Importing essential libraries (Numpy, Pandas, Scikit-learn, Tensorflow2, etc.)
Specifying some constant variables

# mount gdrive
from google.colab import drive
drive.mount('/content/drive')

# download Brexit dataset
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/.credentials/kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download visalakshiiyer/twitter-data-brexit
!unzip -d data/ twitter-data-brexit.zip 

# download GloVe6B dataset
!wget 'https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip'

from shutil import unpack_archive
import os

# unzip file
unpack_archive('glove.6B.zip')
os.remove('glove.6B.300d.txt')
os.remove('glove.6B.200d.txt')
# os.remove('glove.6B.100d.txt')
os.remove('glove.6B.50d.txt')
os.remove('glove.6B.zip')

import numpy as np
import pandas as pd
import tensorflow as tf
import nltk
import re
import string
import pickle

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import (
    Embedding, Conv1D, MaxPooling1D, Bidirectional, LSTM, GRU, SimpleRNN, 
    Dense, Dropout
)

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
clear_output()

label_encoder = LabelEncoder()
lemmatizer = WordNetLemmatizer()
stopwords  = set(nltk.corpus.stopwords.words('english'))

# variables related to dataset and glove data
INPUT_PATH_ANTI  = '/content/data/TweetDataset_AntiBrexit_Jan-Mar2022.csv'
INPUT_PATH_PRO   = '/content/data/TweetDataset_ProBrexit_Jan-Mar2022.csv'
INPUT_PATH_GLOVE = '/content/glove.6B.100d.txt'

# variables related to model checkpoints
HISTORY_PATH    = '/content/drive/MyDrive/Projects/Big Projects/Sentiment Analysis using Deep Learning/history/history.pkl'
CHECKPOINT_PATH = '/content/drive/MyDrive/Projects/Big Projects/Sentiment Analysis using Deep Learning/checkpoint/cp-{epoch:02d}.ckpt'
CHECKPOINT_DIR  = os.path.dirname(CHECKPOINT_PATH)

# variables related to modelling process
num_words = 30_000
row_limit = 100_000
embedding_dim = 100
test_split = 0.10
val_split  = 0.10 / 0.90

2. Data Preparation

As all setup completed, we can prepare the data for training the model. This is done in several steps ranging from importing the dataset to cleaning and tokenizing data to embedding words. Specifically, the process of preparing data includes:

Importing data (pro and anti tweets)
Sampling with a specified size (in this case the sample size is 100,000)
Cleaning data (remove unwanted parts such as emoticon and URLs, lemmatization, etc.)
Splitting data into train, test, and validation
Embedding words using the pre-trained GloVe Embeddings

# import data
tweets_pro  = pd.read_csv(INPUT_PATH_PRO)
tweets_anti = pd.read_csv(INPUT_PATH_ANTI)

# specify sample indices from data
ind_pro  = np.random.choice(len(tweets_pro), replace = False, size = row_limit)
ind_anti = np.random.choice(len(tweets_anti), replace = False, size = row_limit)

# binds pro-brexit and anti-brexit tables
assert np.mean(tweets_pro.dtypes == tweets_anti.dtypes) == 1
assert np.mean(tweets_pro.columns == tweets_anti.columns) == 1

# create dataset for modelling
tweets = pd.concat([tweets_pro["Hit Sentence"][ind_pro], 
                    tweets_anti["Hit Sentence"][ind_anti]])
tweets = tweets.reset_index(drop = True)
labels = pd.Series(np.concatenate([np.repeat(["Pro"], row_limit),
                                   np.repeat(["Anti"], row_limit)]))

del tweets_pro
del tweets_anti

# Create pre-processing functions
def remove_qt_rt_uname(text):
    qt_rt = re.compile(r'(RT|QT)? ?@[\w]+:?')
    return qt_rt.sub(r'', text)

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

def remove_html(text):
    html = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return html.sub(r'', text)

def remove_punct(text):
    table = str.maketrans('', '', string.punctuation)
    return text.translate(table)

def remove_emoji(text):
    emoji_pattern = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE
    )
    return emoji_pattern.sub(r'', text)

def remove_stopwords(text, stopwords = stopwords):
    return " ".join([w for w in text.split(" ") if w.lower() not in stopwords])

def lemmatize(text, lemmatizer = lemmatizer):
    return " ".join([lemmatizer.lemmatize(w) for w in text.split(" ")])

# Pre-process data
tweets = tweets.apply(lambda tweet: remove_qt_rt_uname(tweet)) \
    .apply(lambda tweet: remove_URL(tweet)) \
    .apply(lambda tweet: remove_emoji(tweet)) \
    .apply(lambda tweet: remove_html(tweet)) \
    .apply(lambda tweet: remove_stopwords(tweet)) \
    .apply(lambda tweet: remove_punct(tweet)) \
    .apply(lambda tweet: lemmatize(tweet))

if len(tweets) >= 50000:
    test_split = 5000 / len(tweets)
    val_split  = 5000 / (len(tweets) * (1 - test_split))

# Split data into train, test, and validation
X_train, X_test, y_train, y_test  = train_test_split(
    tweets, 
    labels, 
    stratify = labels, 
    test_size = test_split, 
    random_state = 321
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, 
    y_train, 
    stratify = y_train, 
    test_size = val_split, 
    random_state = 321)


# Check Data
print(f"Train Dataset Size = {len(X_train)}")
print(f"Test Dataset Size  = {len(X_test)}")
print(f"Val Dataset Size   = {len(X_val)}")

Train Dataset Size = 190000
Test Dataset Size  = 5000
Val Dataset Size   = 5000

# Tokenize words
tokenizer = Tokenizer(num_words = num_words, oov_token = "<<OOV>>")
tokenizer.fit_on_texts(X_train)


# Create sequences
sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_test  = tokenizer.texts_to_sequences(X_test)
sequences_val   = tokenizer.texts_to_sequences(X_val)

X_train = pad_sequences(sequences_train, maxlen=256, truncating='pre')
X_test  = pad_sequences(sequences_test, maxlen=256, truncating='pre')
X_val   = pad_sequences(sequences_val, maxlen=256, truncating='pre')


# Encode labels
y_train = label_encoder.fit_transform(y_train)
y_test  = label_encoder.fit_transform(y_test)
y_val   = label_encoder.fit_transform(y_val)


# Check data
vocabSize = len(tokenizer.index_word) + 1
print(f"Vocabulary size = {vocabSize}")
print(f"X train shape   = {X_train.shape}")
print(f"X val shape     = {y_val.shape}")
print(f"X test shape    = {y_train.shape}")
print(f"y train shape   = {X_test.shape}")
print(f"y val shape     = {y_val.shape}")
print(f"y test shape    = {y_test.shape}")

Vocabulary size = 76835
X train shape   = (190000, 256)
X val shape     = (5000,)
X test shape    = (190000,)
y train shape   = (5000, 256)
y val shape     = (5000,)
y test shape    = (5000,)

embeddings_index = {}
num_tokens = vocabSize

# Read word vectors
with open(INPUT_PATH_GLOVE) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        word = lemmatizer.lemmatize(word)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs


# Assign word vectors to our dictionary/vocabulary
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector

3. Model Building

The next step is to build the model. This process contains several tasks.

The first thing is to configure relevant aspects with regards to training. Here I define five callback functions: early stopping, learning rate scheduler, learning rate reducer, model checkpoint, and training terminator given NaN loss value.
The second task is to define the model. I wrote create_model function as the function to generate the model. I defined the model architecture combining convolutional and recurrent layer types with some attributes.
- An embedding layer to convert input sequences into its vector representation based on GloVe embedding whose weights are updated during training
- Two layers of 1-D convolutional layer followed by pooling layer based on maximum value
- A RNN layer
- A dense layer
- A L-2 regularizers which is implemented in the kernel and recurrent regularizers
- Some dropouts layers
Lastly, model is trained with a maximum of 30 epochs using training data and validation data.

# define callback functions
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor = 'val_loss',
    min_delta = 0.001,
    patience = 10
)

lr_scheduler = tf.keras.callbacks.LearningRateScheduler(
    lambda epoch, lr: lr if epoch < 25 else lr * tf.math.exp(-0.01)
)

lr_reducer = tf.keras.callbacks.ReduceLROnPlateau(
    monitor = 'val_loss',
    factor = 0.9,
    patience = 3,
    min_delta = 0.001,
)

check_point = tf.keras.callbacks.ModelCheckpoint(
    filepath = CHECKPOINT_PATH,
    verbose = 0, 
    save_weights_only = True,
    save_freq='epoch'
)

nan_terminator = tf.keras.callbacks.TerminateOnNaN()

callbacks     = [
    early_stopping, 
    lr_scheduler, 
    lr_reducer, 
    nan_terminator, 
    check_point
]

learning_rate = 1e-4

def create_model(num_tokens, embedding_dim, embedding_matrix):
    regularizer = tf.keras.regularizers.l2(0.0001)
    embeddings  = tf.keras.initializers.Constant(embedding_matrix)
    loss        = tf.keras.losses.BinaryCrossentropy()
    optimizer   = tf.keras.optimizers.Adam(learning_rate = learning_rate)
    metrics     = ["accuracy"]

    model = tf.keras.Sequential([
        Embedding(num_tokens,
                  embedding_dim,
                  embeddings_initializer = embeddings,
                  trainable = True),
        Conv1D(filters = 64,
               kernel_size = 3,
               padding = "causal",
               activation = "relu"),
        Dropout(0.4),
        MaxPooling1D(pool_size = 2),
        Conv1D(filters = 216,
               kernel_size = 3,
               padding = "causal",
               activation = "relu"),
        Dropout(0.4),
        MaxPooling1D(pool_size = 2),
        SimpleRNN(128, 
                  activation = "relu", 
                  kernel_regularizer = regularizer,
                  recurrent_regularizer = regularizer),
        Dropout(0.4),
        Dense(1024, activation='relu', kernel_regularizer = regularizer),
        Dropout(0.4),
        Dense(1, activation='sigmoid')
    ])

    model.compile(loss = loss,
                  optimizer = optimizer,
                  metrics = metrics)

    return model
  
model = create_model(num_tokens, embedding_dim, embedding_matrix)
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_3 (Embedding)     (None, None, 100)         7683500   
                                                                 
 conv1d_6 (Conv1D)           (None, None, 64)          19264     
                                                                 
 dropout_12 (Dropout)        (None, None, 64)          0         
                                                                 
 max_pooling1d_6 (MaxPooling  (None, None, 64)         0         
 1D)                                                             
                                                                 
 conv1d_7 (Conv1D)           (None, None, 216)         41688     
                                                                 
 dropout_13 (Dropout)        (None, None, 216)         0         
                                                                 
 max_pooling1d_7 (MaxPooling  (None, None, 216)        0         
 1D)                                                             
                                                                 
 simple_rnn_3 (SimpleRNN)    (None, 128)               44160     
                                                                 
 dropout_14 (Dropout)        (None, 128)               0         
                                                                 
 dense_6 (Dense)             (None, 1024)              132096    
                                                                 
 dropout_15 (Dropout)        (None, 1024)              0         
                                                                 
 dense_7 (Dense)             (None, 1)                 1025      
                                                                 
=================================================================
Total params: 7,921,733
Trainable params: 7,921,733
Non-trainable params: 0
_________________________________________________________________

history = model.fit(
    X_train, 
    y_train, 
    epochs = 30, 
    validation_data = (X_val, y_val), 
    verbose = 1,
    callbacks = callbacks
)

Epoch 1/30
5938/5938 [==============================] - 407s 68ms/step - loss: 0.6638 - accuracy: 0.6349 - val_loss: 0.5620 - val_accuracy: 0.7490 - lr: 1.0000e-04
Epoch 2/30
5938/5938 [==============================] - 406s 68ms/step - loss: 0.4963 - accuracy: 0.7835 - val_loss: 0.4800 - val_accuracy: 0.8124 - lr: 1.0000e-04
Epoch 3/30
5938/5938 [==============================] - 402s 68ms/step - loss: 0.4224 - accuracy: 0.8249 - val_loss: 0.4390 - val_accuracy: 0.8328 - lr: 1.0000e-04
Epoch 4/30
5938/5938 [==============================] - 398s 67ms/step - loss: 0.3820 - accuracy: 0.8451 - val_loss: 0.4043 - val_accuracy: 0.8382 - lr: 1.0000e-04
Epoch 5/30
5938/5938 [==============================] - 396s 67ms/step - loss: 0.3552 - accuracy: 0.8584 - val_loss: 0.3784 - val_accuracy: 0.8490 - lr: 1.0000e-04
Epoch 6/30
5938/5938 [==============================] - 399s 67ms/step - loss: 0.3334 - accuracy: 0.8683 - val_loss: 0.3654 - val_accuracy: 0.8558 - lr: 1.0000e-04
Epoch 7/30
5938/5938 [==============================] - 399s 67ms/step - loss: 0.3187 - accuracy: 0.8740 - val_loss: 0.3562 - val_accuracy: 0.8584 - lr: 1.0000e-04
Epoch 8/30
5938/5938 [==============================] - 401s 68ms/step - loss: 0.3070 - accuracy: 0.8798 - val_loss: 0.3542 - val_accuracy: 0.8598 - lr: 1.0000e-04
Epoch 9/30
5938/5938 [==============================] - 403s 68ms/step - loss: 0.2959 - accuracy: 0.8854 - val_loss: 0.3419 - val_accuracy: 0.8642 - lr: 1.0000e-04
Epoch 10/30
5938/5938 [==============================] - 402s 68ms/step - loss: 0.2872 - accuracy: 0.8883 - val_loss: 0.3418 - val_accuracy: 0.8652 - lr: 1.0000e-04
Epoch 11/30
5938/5938 [==============================] - 398s 67ms/step - loss: 0.2781 - accuracy: 0.8931 - val_loss: 0.3390 - val_accuracy: 0.8682 - lr: 1.0000e-04
Epoch 12/30
5938/5938 [==============================] - 400s 67ms/step - loss: 0.2709 - accuracy: 0.8953 - val_loss: 0.3362 - val_accuracy: 0.8690 - lr: 1.0000e-04
Epoch 13/30
5938/5938 [==============================] - 396s 67ms/step - loss: 0.2654 - accuracy: 0.8985 - val_loss: 0.3373 - val_accuracy: 0.8704 - lr: 1.0000e-04
Epoch 14/30
5938/5938 [==============================] - 397s 67ms/step - loss: 0.2594 - accuracy: 0.9012 - val_loss: 0.3358 - val_accuracy: 0.8696 - lr: 1.0000e-04
Epoch 15/30
5938/5938 [==============================] - 396s 67ms/step - loss: 0.2532 - accuracy: 0.9042 - val_loss: 0.3396 - val_accuracy: 0.8668 - lr: 1.0000e-04
Epoch 16/30
5938/5938 [==============================] - 395s 67ms/step - loss: 0.2475 - accuracy: 0.9065 - val_loss: 0.3390 - val_accuracy: 0.8672 - lr: 9.0000e-05
Epoch 17/30
5938/5938 [==============================] - 395s 66ms/step - loss: 0.2432 - accuracy: 0.9082 - val_loss: 0.3370 - val_accuracy: 0.8698 - lr: 9.0000e-05
Epoch 18/30
5938/5938 [==============================] - 400s 67ms/step - loss: 0.2392 - accuracy: 0.9096 - val_loss: 0.3379 - val_accuracy: 0.8682 - lr: 9.0000e-05
Epoch 19/30
5938/5938 [==============================] - 404s 68ms/step - loss: 0.2333 - accuracy: 0.9122 - val_loss: 0.3355 - val_accuracy: 0.8692 - lr: 8.1000e-05
Epoch 20/30
5938/5938 [==============================] - 398s 67ms/step - loss: 0.2299 - accuracy: 0.9144 - val_loss: 0.3375 - val_accuracy: 0.8684 - lr: 8.1000e-05
Epoch 21/30
5938/5938 [==============================] - 396s 67ms/step - loss: 0.2268 - accuracy: 0.9153 - val_loss: 0.3382 - val_accuracy: 0.8696 - lr: 8.1000e-05
Epoch 22/30
5938/5938 [==============================] - 393s 66ms/step - loss: 0.2227 - accuracy: 0.9172 - val_loss: 0.3427 - val_accuracy: 0.8668 - lr: 7.2900e-05

4. Model Evaluation

Trained model is evaluated using testing data. The results show that the model is capable to predict the classification of tweets related to Brexit with the score of 86.3% accuracy.

model.evaluate(X_test, y_test)

157/157 [==============================] - 1s 8ms/step - loss: 0.3450 - accuracy: 0.8632

[0.34497830271720886, 0.8632000088691711]

5. Conclusion

Based on the whole process, it can be concluded that a Deep Learning model could find the pattern to differentiate between pro- and anti-Brexit tweets. The model can predict with about 86 percent accuracy. Furthermore, combining between convolutional and recurrent network is proven to be working for this type of data. The different architectures were also attempted to produce (e.g., pure neural network, pure recurrent neural network, pure convolutional neural network, and neural networks with LSTM layers), but most were not as optimal as this architecture in terms of model performance and training speed. The analysis also showed that pre-trained word embeddings can be used in training a deep learning model with natural language data.