# load libraries
library(dplyr)
library(ggplot2)
library(stringr)
# set URL for downloading data
<- paste('https://raw.githubusercontent.com',
.dataURL 'rfordatascience',
'tidytuesday',
'master',
'data',
'2023',
'2023-01-17',
'artists.csv',
sep = "/")
# download data
<- readr::read_csv(.dataURL) artists
Tidy Tuesday is a weekly data project which is intended to be a platform for R
users improving their capabilities in leveraging tidyverse
ecosystem for data-related tasks such as data manipulation and visualisation provided by the R4DS online community
(Mock 2022). I have known about the community since months ago and always been interested to participate since. This is the first time to actually work with the data provided by the organizer, so I will just use simple techniques. The main goal here is to just get accustomed with the environment as well as practicing writing and coding fast.
The topic for this week is art history data from arthistory
data package by Lemus and Stam (2022) which contains data used for a thesis, titled Quantifying Art Historical Narratives, by Stam (2022). The package was intended to survey the demographic trends among artists in two of the most popular textbooks in art history in America, Janson’s History of Art and Gardner’s Art through the Ages.
Preparation
The first things to do is to load relevant library. Here, I used several packages from tidyverse
, specifically dplyr
for data manipulation, ggplot2
for data visualisation, and stringr
for string manipulation. I also used a function from readr
package, read_csv,
to download the data from the TidyTuesday github repository.
Exploration and Cleaning
The next step is to perform a simple exploration for the data. The most basic function to do this is summary
from base R
.
summary(artists)
artist_name edition_number year artist_nationality
Length:3162 Min. : 1.000 Min. :1926 Length:3162
Class :character 1st Qu.: 5.000 1st Qu.:1986 Class :character
Mode :character Median : 8.000 Median :1996 Mode :character
Mean : 8.223 Mean :1994
3rd Qu.:12.000 3rd Qu.:2009
Max. :16.000 Max. :2020
artist_nationality_other artist_gender artist_race
Length:3162 Length:3162 Length:3162
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
artist_ethnicity book space_ratio_per_page_total
Length:3162 Length:3162 Min. :0.0946
Class :character Class :character 1st Qu.:0.3082
Mode :character Mode :character Median :0.4093
Mean :0.5301
3rd Qu.:0.5941
Max. :3.7967
artist_unique_id moma_count_to_year whitney_count_to_year artist_race_nwi
Min. : 1.0 Min. : 0.000 Min. : 0.000 Length:3162
1st Qu.:108.0 1st Qu.: 0.000 1st Qu.: 0.000 Class :character
Median :189.0 Median : 1.000 Median : 0.000 Mode :character
Mean :201.8 Mean : 4.306 Mean : 1.957
3rd Qu.:305.8 3rd Qu.: 5.000 3rd Qu.: 0.000
Max. :413.0 Max. :64.000 Max. :40.000
Furthermore, the function purrr::map
could be helpful and versatile for exploration. Specifically, I wanted to see unique values for each variables which lead me to find some missing data represented in character values e.g. "N/A"
, "N/A1"
, "N/A2"
, etc. Since the output was long, I limit to only print the artist_race
column.
# find unique values for each column
::map(artists, unique)$artist_race purrr
Another function from purrr::map
family is purrr::map_if
which you can use to perform a function for specific columns. Here, all missing values represented in character was replaced as NA
.
# replace missing values with `NA`
<- artists |>
artists ::map_if(is.character,
purrrfunction(x) {
str_replace_all(x, "^N/A.*", "") |>
na_if("")
|>
}) as_tibble() |>
na.omit()
Data Visualisation
For summarising data in visual format, I tried to replicate some of Stam (2022). The first thing is to gain insight on the number of artists in Gardner’s Art Through the Ages. The steps for creating the graph are:
filter
the work of Gardner’s Art Through the Ages.summarise
the number of artists in the book, grouped byyear
, and store it incount
variable.- create a plot using
ggplot
function by assigningyear
as the x-axis andcount
as the y-axis. - use
geom_col
to add graphical element of the bar chart. - use
geom_text
to add labels of each count. - for more customization,
theme_minimal
is used. - labels are added using
labs
function.
# Visualising artist count of Gardner's Art Through the Ages
|>
artists filter(book == "Gardner") |>
group_by(year) |>
summarise(count = n()) |>
ggplot(aes(x = year, y = count)) +
geom_col(width = 2, fill = "#43ac65") +
geom_text(aes(label = count, y = count + 10), size = 2.5) +
theme_minimal() +
labs(title = "Overall Count of Artists in Gardner's Art Through the Ages",
x = "Year of Publication",
y = "Count")
The next graph describes the distribution of genders of the artists. This can be achieved through the following steps.
filter
the work of Gardner’s Art Through the Ages.- create a plot using
ggplot
function by assigningyear
as the x-axis andartist_gender
as the fill component of the graph. - use
geom_bar
to add graphical element of the bar chart. - use
geom_hline
to add a horizontal line with the male proportion as the y value. - for more customization,
theme_minimal
is used. - labels are added using
labs
function.
# calculate the average proportion of male artist
<- mean(filter(artists, book == "Gardner")$artist_gender == "Male")
gardner_male_avg
# Visualising gender distribution of Gardner's Art Through the Ages
|>
artists filter(book == "Gardner") |>
ggplot(aes(x = year, fill = artist_gender)) +
geom_bar(position = "fill", width = 2) +
geom_hline(yintercept = gardner_male_avg, size = 1) +
scale_fill_manual(values = c("#aa4365", "#4365aa")) +
theme_minimal() +
labs(title = "Gender of Artists in Gardner's Art Through the Ages",
x = "Year of Publication",
y = "Proportion",
fill = "Artist Gender")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Using the same methods, the code could also be used to create visualisation for the Janson’s History of Art data.
Wrap up
Using the code above, I could recreate the visualisations by Stam (2022). I use mostly functions from ggplot
and dplyr
packages for data visualisation and manipulation. The tidyverse
provides its users with many functions that can tackle most of data science and analysis jobs. Finally, Tidy Tuesday
create a safe and supportive environment for R users to learn and practice using its functionality.