Tidy Tuesday is a weekly data project which is intended to be a platform for R users improving their capabilities in leveraging tidyverse ecosystem for data-related tasks such as data manipulation and visualisation provided by the R4DS online community (Mock 2022). I have known about the community since months ago and always been interested to participate since. This is the first time to actually work with the data provided by the organizer, so I will just use simple techniques. The main goal here is to just get accustomed with the environment as well as practicing writing and coding fast.

The topic for this week is art history data from arthistory data package by Lemus and Stam (2022) which contains data used for a thesis, titled Quantifying Art Historical Narratives, by Stam (2022). The package was intended to survey the demographic trends among artists in two of the most popular textbooks in art history in America, Janson’s History of Art and Gardner’s Art through the Ages.

Preparation

The first things to do is to load relevant library. Here, I used several packages from tidyverse, specifically dplyr for data manipulation, ggplot2 for data visualisation, and stringr for string manipulation. I also used a function from readr package, read_csv, to download the data from the TidyTuesday github repository.

# load libraries
library(dplyr)
library(ggplot2)
library(stringr)

# set URL for downloading data
.dataURL <- paste('https://raw.githubusercontent.com',
                  'rfordatascience',
                  'tidytuesday',
                  'master',
                  'data',
                  '2023',
                  '2023-01-17',
                  'artists.csv',
                  sep = "/")

# download data
artists <- readr::read_csv(.dataURL)

Exploration and Cleaning

The next step is to perform a simple exploration for the data. The most basic function to do this is summary from base R.

summary(artists)

 artist_name        edition_number        year      artist_nationality
 Length:3162        Min.   : 1.000   Min.   :1926   Length:3162       
 Class :character   1st Qu.: 5.000   1st Qu.:1986   Class :character  
 Mode  :character   Median : 8.000   Median :1996   Mode  :character  
                    Mean   : 8.223   Mean   :1994                     
                    3rd Qu.:12.000   3rd Qu.:2009                     
                    Max.   :16.000   Max.   :2020                     
 artist_nationality_other artist_gender      artist_race       
 Length:3162              Length:3162        Length:3162       
 Class :character         Class :character   Class :character  
 Mode  :character         Mode  :character   Mode  :character  
                                                               
                                                               
                                                               
 artist_ethnicity       book           space_ratio_per_page_total
 Length:3162        Length:3162        Min.   :0.0946            
 Class :character   Class :character   1st Qu.:0.3082            
 Mode  :character   Mode  :character   Median :0.4093            
                                       Mean   :0.5301            
                                       3rd Qu.:0.5941            
                                       Max.   :3.7967            
 artist_unique_id moma_count_to_year whitney_count_to_year artist_race_nwi   
 Min.   :  1.0    Min.   : 0.000     Min.   : 0.000        Length:3162       
 1st Qu.:108.0    1st Qu.: 0.000     1st Qu.: 0.000        Class :character  
 Median :189.0    Median : 1.000     Median : 0.000        Mode  :character  
 Mean   :201.8    Mean   : 4.306     Mean   : 1.957                          
 3rd Qu.:305.8    3rd Qu.: 5.000     3rd Qu.: 0.000                          
 Max.   :413.0    Max.   :64.000     Max.   :40.000

Furthermore, the function purrr::map could be helpful and versatile for exploration. Specifically, I wanted to see unique values for each variables which lead me to find some missing data represented in character values e.g. "N/A", "N/A1", "N/A2", etc. Since the output was long, I limit to only print the artist_race column.

# find unique values for each column
purrr::map(artists, unique)$artist_race

Another function from purrr::map family is purrr::map_if which you can use to perform a function for specific columns. Here, all missing values represented in character was replaced as NA.

# replace missing values with `NA`
artists <- artists |> 
  purrr::map_if(is.character, 
                function(x) {
                  str_replace_all(x, "^N/A.*", "") |> 
                    na_if("")
                }) |> 
  as_tibble() |> 
  na.omit()

Data Visualisation

For summarising data in visual format, I tried to replicate some of Stam (2022). The first thing is to gain insight on the number of artists in Gardner’s Art Through the Ages. The steps for creating the graph are:

filter the work of Gardner’s Art Through the Ages.
summarise the number of artists in the book, grouped by year, and store it in count variable.
create a plot using ggplot function by assigning year as the x-axis and count as the y-axis.
use geom_col to add graphical element of the bar chart.
use geom_text to add labels of each count.
for more customization, theme_minimal is used.
labels are added using labs function.

# Visualising artist count of Gardner's Art Through the Ages
artists |> 
  filter(book == "Gardner") |> 
  group_by(year) |> 
  summarise(count = n()) |> 
  ggplot(aes(x = year, y = count)) +
  geom_col(width = 2, fill = "#43ac65") +
  geom_text(aes(label = count, y = count + 10),  size = 2.5) +
  theme_minimal() +
  labs(title = "Overall Count of Artists in Gardner's Art Through the Ages",
       x = "Year of Publication",
       y = "Count")

The next graph describes the distribution of genders of the artists. This can be achieved through the following steps.

filter the work of Gardner’s Art Through the Ages.
create a plot using ggplot function by assigning year as the x-axis and artist_gender as the fill component of the graph.
use geom_bar to add graphical element of the bar chart.
use geom_hline to add a horizontal line with the male proportion as the y value.
for more customization, theme_minimal is used.
labels are added using labs function.

# calculate the average proportion of male artist
gardner_male_avg <- mean(filter(artists, book == "Gardner")$artist_gender == "Male")

# Visualising gender distribution of Gardner's Art Through the Ages
artists |> 
  filter(book == "Gardner") |> 
  ggplot(aes(x = year, fill = artist_gender)) +
  geom_bar(position = "fill", width = 2) +
  geom_hline(yintercept = gardner_male_avg, size = 1) +
  scale_fill_manual(values = c("#aa4365", "#4365aa")) +
  theme_minimal() +
  labs(title = "Gender of Artists in Gardner's Art Through the Ages",
       x = "Year of Publication",
       y = "Proportion",
       fill = "Artist Gender")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Using the same methods, the code could also be used to create visualisation for the Janson’s History of Art data.

Wrap up

Using the code above, I could recreate the visualisations by Stam (2022). I use mostly functions from ggplot and dplyr packages for data visualisation and manipulation. The tidyverse provides its users with many functions that can tackle most of data science and analysis jobs. Finally, Tidy Tuesday create a safe and supportive environment for R users to learn and practice using its functionality.

References

Lemus, Sara, and Holland Stam. 2022. Arthistory: Art History Textbook Data.

Mock, Thomas. 2022. “Tidy Tuesday: A Weekly Data Project Aimed at the r Ecosystem.” https://github.com/rfordatascience/tidytuesday.

Stam, Holland. 2022. “Quantifying Art Historical Narratives.”