import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rcParams
from datetime import datetime
import warnings
'ignore')
warnings.filterwarnings(
set() sns.
Background
The Brexit was a term that refers to the withdrawal of the United Kingdom (UK) from the European Union (EU) after 40 years of relationship. Officially, the UK left on 31 January 2020, marking it as the first and sole country to ever left the EU. The term ‘Brexit’ refers to a combination of words Britain and exit. As Brexit has significant implications to the people of the UK, diversing opinions (positively and negatively) arose with the event. Some argue the merits of Brexit including more control over democracy, borders, and money that would improve several areas, e.g., healthcare, costumer rights, and environment. On the other end, people opposes the idea as the decision impact negatively in trade, migration, and investments. This complexity and delicacy are present in the social media discussions such as in Twitter.
This is the first part of on the analysis of Brexit polarity tweets, which is the exploratory analysis part. The project aims to build a neural network-based classifier to predict whether a tweet is created by a user who supports or opposes Brexit. This analysis leverage data from Kaggle: Brexit Polarity Tweets.
The project’s Github repository can be accessed here.
About the dataset
These datasets were collated as part of a dissertation project. This Twitter dataset covers the January - March 2022 period and comprises tweets relating to Brexit or Europe from Twitter accounts with publicly stated Brexit positions in their bio. It was collected using Boolean search for both types of users.
The Boolean search for pro-Brexit tweet is:
(bio:“Brexit support” OR bio:“pro-brexit” OR bio:“pro brexit” OR bio:“Pro #Brexit” OR bio:brexiteer OR bio:probrexit) AND (EU OR Brexit OR CUSTOMS OR EUROPEAN OR EUROPE OR #Remain OR *Brexit OR #rejoinEU)
The Boolean search for anti-Brexit tweet is:
(bio:“anti brexit” OR bio:“anti-brexit” OR bio:“antibrexit” OR bio:“Pro remain” OR bio:“pro-remain” OR bio:remainer) AND (EU OR BREXIT OR CUSTOMS OR EUROPEAN OR EUROPE OR #Remain OR *Brexit)
Environment Setup
= './data/raw/TweetDataset_AntiBrexit_Jan-Mar2022.csv'
PATH_ANTI = './data/raw/TweetDataset_ProBrexit_Jan-Mar2022.csv' PATH_PRO
1 Import Data
The first step is to import both datasets (pro
and anti
) then combine those into a single dataframe.
# import data from file
= pd.read_csv(PATH_PRO)
pro = pd.read_csv(PATH_ANTI)
anti
# add column for types of users
"Status"] = "Anti"
anti["Status"] = "Pro" pro[
# ensure that datasets have identical column names & types
assert np.all(pro.dtypes == anti.dtypes)
assert np.all(pro.columns == anti.columns)
# combine data
= pd.concat([anti, pro], ignore_index = True)
df df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358205 entries, 0 to 358204
Data columns (total 39 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 358205 non-null int64
1 Date 358205 non-null object
2 Headline 0 non-null float64
3 URL 358205 non-null object
4 Opening Text 0 non-null float64
5 Hit Sentence 358205 non-null object
6 Source 358205 non-null object
7 Influencer 358205 non-null object
8 Country 358205 non-null object
9 Subregion 0 non-null float64
10 Language 358205 non-null object
11 Reach 358205 non-null int64
12 Desktop Reach 358205 non-null int64
13 Mobile Reach 358205 non-null int64
14 Twitter Social Echo 0 non-null float64
15 Facebook Social Echo 0 non-null float64
16 Reddit Social Echo 0 non-null float64
17 National Viewership 358205 non-null int64
18 Engagement 25643 non-null float64
19 AVE 358205 non-null float64
20 Sentiment 358205 non-null object
21 Key Phrases 313001 non-null object
22 Input Name 358205 non-null object
23 Keywords 358205 non-null object
24 Twitter Authority 357967 non-null float64
25 Tweet Id 358205 non-null object
26 Twitter Id 358205 non-null int64
27 Twitter Client 358205 non-null object
28 Twitter Screen Name 358205 non-null object
29 User Profile Url 358205 non-null object
30 Twitter Bio 358205 non-null object
31 Twitter Followers 358044 non-null float64
32 Twitter Following 358036 non-null float64
33 Alternate Date Format 358205 non-null object
34 Time 358205 non-null object
35 State 199545 non-null object
36 City 140011 non-null object
37 Document Tags 0 non-null float64
38 Status 358205 non-null object
dtypes: float64(12), int64(6), object(21)
memory usage: 106.6+ MB
2 Data Cleaning
The second step is to clean some problematic data in the dataset. Some columns contains irrelevant information or no information at all; thus, those columns are removed from the data. Identical tweets are removed based on the Tweet Id
column. Another important process is to correctly represent date and time information.
Removing Irrelevant and Empty Columns
# filter irrrelevant columns and columns with no data
= ["Unnamed: 0", "Source", "Time", "Alternate Date Format"]
irrelevant_cols = df.apply(lambda col: np.all(col.isna()))
non_null_cols = np.invert(non_null_cols)
non_null_cols
= df.drop(irrelevant_cols, axis = 1)
df = df.loc[:, non_null_cols] df
Removing Duplicated Tweets
# remove duplicated tweets based on ID
= df.drop_duplicates(subset = "Tweet Id") df
Correcting Date and Time
"DateTime"] = pd.to_datetime(df["Date"])
df["Time"] = df["DateTime"].dt.time
df["Date"] = df["DateTime"].dt.date df[
3 Exploratory Analysis
Polarity
Tweets from Anti category outnumbered the other for more than 50 thousand tweets. This is illustrated from the graph.
'Status']).agg({'Tweet Id':['count']}).plot(kind='barh',figsize=(5, 3), legend = None)
df.groupby([
'Number of Anti and Pro Status on Twitter')
plt.title( plt.show()
Geographical Location
Most of the tweets came around the UK. This is expected as the people of the UK are the ones who have the greatest interest on the topic. We can see that the UK is placed at the top of the countries whereas the USA is the second.
= df.groupby("Country").agg({"Tweet Id": "count"}) \
top_countries "Tweet Id", ascending = False).head(5).index.values
.sort_values(
= df.groupby(["Country", "Status"], as_index = False)
country = country.agg({"Tweet Id": "count"})
country = country.iloc[np.isin(country["Country"], top_countries), :]
country = country.sort_values("Tweet Id", ascending = False)
country
country
= (15, 4))
plt.figure(figsize
for i, status in enumerate(["Pro", "Anti"]):
1, 2, i+1)
plt.subplot(= "Country", y = "Tweet Id", color = "#456456",
sns.barplot(x = country.loc[country["Status"] == status])
data = country["Tweet Id"].max() * 1.05)
plt.ylim(top f"{status} Tweets' Country of Origin")
plt.title(None)
plt.ylabel(None)
plt.xlabel(
plt.show()
As can be seen from the graph of the top 10 origin states of tweets below, people from the USA also have a great interest on the matter. Specifically, the number of tweets coming from England are around seven times hhigher than the second place, Scotland.
= df.groupby(["State"]) \
state "Tweet Id": "count"}) \
.agg({"Tweet Id") \
.sort_values(10)
.tail(
= "barh", figsize = (5, 3))
state.plot(kind = "lower right")
plt.legend(loc plt.show()
Users’ Devices
Both Pro and Anti groups demonstrated similar distribution of devices used. Slight difference can be found with the number of web app, which is more common for Pro users, and iPhone, which is more common for Anti users.
= df.groupby("Twitter Client").agg({"Tweet Id": "count"}) \
top_devices "Tweet Id", ascending = False).head(5).index.values
.sort_values(
= df.groupby(["Twitter Client", "Status"], as_index = False)
devices = devices.agg({"Tweet Id": "count"})
devices = devices.iloc[np.isin(devices["Twitter Client"], top_devices), :]
devices = devices.sort_values("Tweet Id", ascending = False)
devices
devices
= (15, 4))
plt.figure(figsize
for i, status in enumerate(["Pro", "Anti"]):
1, 2, i+1)
plt.subplot(= "Twitter Client", y = "Tweet Id", color = "#456456",
sns.barplot(x = devices.loc[devices["Status"] == status])
data = devices["Tweet Id"].max() * 1.05)
plt.ylim(top f"{status} Tweets' Client")
plt.title(None)
plt.ylabel(None)
plt.xlabel(
plt.show()
Timing of Tweets
The number of tweets fluctuated overtime, and it reached its peak between the end of February and and early March. We can see the distribution of the number of tweets from two graphs below. The second graph differentiate the distribution of Pro and Anti tweet categories.
= (15, 4))
plt.figure(figsize = "DateTime", data = df, fill = "DateTime", bw_adjust=0.2)
sns.kdeplot(x plt.show()
= (15, 4))
plt.figure(figsize = "DateTime", hue = "Status", data = df, fill = "Status", bw_adjust=0.2)
sns.kdeplot(x plt.show()
It can be seen from the graph below that tweets were posted mostly during 8 a.m. to 10 p.m. with the peak reached around 9 o’clock in the morning. What might be interesting is that the increase in the number is caused by Anti tweets as the number decreased significantly after it reached its peak. The pattern was not found in the Pro tweets.
from matplotlib import dates as mdates
"Time"] = "2000-01-01 " + pd.to_datetime(df["DateTime"]).dt.time.astype(str)
df["Time"] = pd.to_datetime(df["Time"])
df[
= (7, 4))
plt.figure(figsize
= sns.kdeplot(x = "Time", data = df, fill = "Time", bw_adjust=0.2)
ax
'2000-01-01 00:00:00'),
ax.set_xlim([pd.to_datetime('2000-01-01 23:59:59')])
pd.to_datetime('%H:%M'))
ax.xaxis.set_major_formatter(mdates.DateFormatter(
plt.show()
from matplotlib import dates as mdates
= (7, 4))
plt.figure(figsize
= sns.kdeplot(x = "Time", hue = "Status", data = df, fill = "Status", bw_adjust=0.2)
ax
'2000-01-01 00:00:00'),
ax.set_xlim([pd.to_datetime('2000-01-01 23:59:59')])
pd.to_datetime('%H:%M'))
ax.xaxis.set_major_formatter(mdates.DateFormatter(
"upper left")
sns.move_legend(ax, plt.show()
Users’ Engagement
Engagement can be demonstrated from the authority score and average engagement score. Twitter authority measure the influence of a user has in the platform based on metrics like retweet rate, activities, and follower & following counts. Average engagement refers to thhe number of interactions which might include likes, replies, clicks, etc.
From the boxplot below, it can be inferred that Twitter Authority Score follow a normal distribution with median of 6. On the other hand both distributions of the numbers of twitter followers and following has skewed shape. This can be expected as some users will have much more followers/following than the rest.
= ['Twitter Authority','Twitter Followers','Twitter Following']
list_of_columns
=(10, 3))
plt.figure(figsize
for i in range(len(list_of_columns)):
1, 3, i+1)
plt.subplot(= list_of_columns[i], data = df, color = 'lightblue')
sns.boxplot(x 'Boxplot of {}'.format(list_of_columns[i]))
plt.title(
plt.tight_layout()
plt.show()
We can also scrutinize the distribution of users’ authority score based on its polarity. Both groups have similar bell shape distribution of authority.
= df.groupby(["Twitter Authority", "Status"], as_index = False)
authority = authority.agg({"Tweet Id": "count"})
authority = authority.sort_values("Tweet Id", ascending = False)
authority
authority
= (15, 4))
plt.figure(figsize
for i, status in enumerate(["Pro", "Anti"]):
1, 2, i+1)
plt.subplot(= "Twitter Authority", y = "Tweet Id", color = "#456456",
sns.barplot(x = authority.loc[authority["Status"] == status])
data = authority["Tweet Id"].max() * 1.05)
plt.ylim(top f"{status} Tweets' Authority")
plt.title(None)
plt.ylabel(None)
plt.xlabel(
plt.show()
The relationship between twitter authority and engagement is positive, meaning the effectiveness of a tweet from a user will be increased as the authority increase. This is illustrated from the graph below.
= df.groupby(["Twitter Authority"]) \
effectiveness "AVE": "mean",
.agg({"Twitter Following": "mean",
"Twitter Followers": "mean"}) \
"AVE": (lambda x: 100 * x / sum(x)),
.agg({"Twitter Following": (lambda x: 100 * x / sum(x)),
"Twitter Followers": (lambda x: 100 * x / sum(x))}) \
.reset_index()
set(rc={'figure.figsize':(7, 4)})
sns.= "Twitter Authority", y = "AVE", data = effectiveness, marker = "o")
sns.lineplot(x
'Relationship between effectiveness of engagement with Twitter Authority Score')
plt.title("Twitter Authority")
plt.xlabel("(%) AVE")
plt.ylabel( plt.show()
The graph below represents the relationship between twitter authority and the numbers of following and followers. Similar positive relationship can be found in both. However, there is a significant decrease on the average number of following from users’ with authority of 9 to 10.
=(7, 4))
plt.figure(figsize
"Twitter Authority", "Twitter Following", data = effectiveness, label = "Following")
plt.bar("Twitter Authority", "Twitter Followers", data = effectiveness, c = "red", marker = "o",
plt.plot(= "Followers")
label
'Relationship between effectiveness of engagement with numbers of following & followers')
plt.title('Twitter Authority')
plt.xlabel('(%) Number of Followers and Following')
plt.ylabel(
plt.legend() plt.show()
Conclusion
Based on the analysis above, some conclusions can be drawn. Tweets that oppose the idea of Brexit are more common based on the period of data collection. It might be the indicatation that that people tend to disagree with the idea of the UK withdrawing its membership from the EU. Nevertheless, the notion is restricted only to the period of observation.
Most users are from the UK, but some are from other countries including the USA, Canada, and France. The discussion on Brexit, as can be represented by the number of tweets given a period time, intensified between the end of February and early March 2022. Daily, more tweets are typically written between 8.15 and 10 in the morning where users generally use Android, iPhone, Web App, and Ipad.
As one can expect, most Anti’s and Pro’s tweets come from user that has score of twitter authority = 6, and the pattern of relationship between Status and Twitter Authority create normal distribution which shows that the data near the mean are more frequent in occurrence than data far from the mean.