The Brexit was a term that refers to the withdrawal of the United Kingdom (UK) from the European Union (EU) after 40 years of relationship. Officially, the UK left on 31 January 2020, marking it as the first and sole country to ever left the EU. The term ‘Brexit’ refers to a combination of words Britain and exit. As Brexit has significant implications to the people of the UK, diversing opinions (positively and negatively) arose with the event. Some argue the merits of Brexit including more control over democracy, borders, and money that would improve several areas, e.g., healthcare, costumer rights, and environment. On the other end, people opposes the idea as the decision impact negatively in trade, migration, and investments. This complexity and delicacy are present in the social media discussions such as in Twitter.
This is the first part of on the analysis of Brexit polarity tweets, which is the exploratory analysis part. The project aims to build a neural network-based classifier to predict whether a tweet is created by a user who supports or opposes Brexit. This analysis leverage data from Kaggle: Brexit Polarity Tweets.
The project’s Github repository can be accessed here.
About the dataset
These datasets were collated as part of a dissertation project. This Twitter dataset covers the January - March 2022 period and comprises tweets relating to Brexit or Europe from Twitter accounts with publicly stated Brexit positions in their bio. It was collected using Boolean search for both types of users.
The Boolean search for pro-Brexit tweet is:
(bio:“Brexit support” OR bio:“pro-brexit” OR bio:“pro brexit” OR bio:“Pro #Brexit” OR bio:brexiteer OR bio:probrexit) AND (EU OR Brexit OR CUSTOMS OR EUROPEAN OR EUROPE OR #Remain OR *Brexit OR #rejoinEU)
The Boolean search for anti-Brexit tweet is:
(bio:“anti brexit” OR bio:“anti-brexit” OR bio:“antibrexit” OR bio:“Pro remain” OR bio:“pro-remain” OR bio:remainer) AND (EU OR BREXIT OR CUSTOMS OR EUROPEAN OR EUROPE OR #Remain OR *Brexit)
1. Environment Setup
The notebook was run on the Google Colab platform which provides additional functionalities such as Google Drive connectivity and pre-installed Kaggle API. For setting up the analysis, several task was performed, including:
- Mounting to google drive for Kaggle API credential
- Downloading data directly from Kaggle using Kaggle API
- Downloading GloVe6B dataset for embedding language data
- Importing essential libraries (Numpy, Pandas, Scikit-learn, Tensorflow2, etc.)
- Specifying some constant variables
2. Data Preparation
As all setup completed, we can prepare the data for training the model. This is done in several steps ranging from importing the dataset to cleaning and tokenizing data to embedding words. Specifically, the process of preparing data includes:
- Importing data (pro and anti tweets)
- Sampling with a specified size (in this case the sample size is 100,000)
- Cleaning data (remove unwanted parts such as emoticon and URLs, lemmatization, etc.)
- Splitting data into
, andvalidation
- Embedding words using the pre-trained GloVe Embeddings
3. Model Building
The next step is to build the model. This process contains several tasks.
- The first thing is to configure relevant aspects with regards to training. Here I define five callback functions: early stopping, learning rate scheduler, learning rate reducer, model checkpoint, and training terminator given
loss value. - The second task is to define the model. I wrote
function as the function to generate the model. I defined the model architecture combining convolutional and recurrent layer types with some attributes.- An embedding layer to convert input sequences into its vector representation based on GloVe embedding whose weights are updated during training
- Two layers of 1-D convolutional layer followed by pooling layer based on maximum value
- A RNN layer
- A dense layer
- A L-2 regularizers which is implemented in the kernel and recurrent regularizers
- Some dropouts layers
- Lastly, model is trained with a maximum of 30 epochs using training data and validation data.
4. Model Evaluation
Trained model is evaluated using testing data. The results show that the model is capable to predict the classification of tweets related to Brexit with the score of 86.3% accuracy.
5. Conclusion
Based on the whole process, it can be concluded that a Deep Learning model could find the pattern to differentiate between pro- and anti-Brexit tweets. The model can predict with about 86 percent accuracy. Furthermore, combining between convolutional and recurrent network is proven to be working for this type of data. The different architectures were also attempted to produce (e.g., pure neural network, pure recurrent neural network, pure convolutional neural network, and neural networks with LSTM layers), but most were not as optimal as this architecture in terms of model performance and training speed. The analysis also showed that pre-trained word embeddings can be used in training a deep learning model with natural language data.