Setup and IntroductionΒΆ
# for dataframe wrangling
import pandas as pd
import numpy as np
mta_art = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-22/mta_art.csv')
station_lines = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-22/station_lines.csv')
mta_art.head()
agency | station_name | line | artist | art_title | art_date | art_material | art_description | art_image_link | |
---|---|---|---|---|---|---|---|---|---|
0 | NYCT | Clark St | 2,3 | Ray Ring | Clark Street Passage | 1987 | Terrazzo floor tile | The first model that Brooklyn-born artist Ray ... | https://new.mta.info/agency/arts-design/collec... |
1 | NYCT | 125 St | 4,5,6 | Houston Conwill | The Open Secret | 1986 | Bronze - polychromed | The Open Secret, in the 125th Street and Lexin... | https://new.mta.info/agency/arts-design/collec... |
2 | NYCT | Astor Pl | 6 | Milton Glaser | Untitled | 1986 | Porcelain enamel murals | Milton Glaser, best known for his work in grap... | https://new.mta.info/agency/arts-design/collec... |
3 | NYCT | Kings Hwy | B,Q | Rhoda Andors | Kings Highway Hieroglyphs | 1987 | Porcelain Enamel Murals on Steel | The artist discusses her work: ΓIf public art... | https://new.mta.info/agency/arts-design/collec... |
4 | NYCT | Newkirk Av | B,Q | David Wilson | Transit Skylight | 1988 | Zinc-glazed Apolycarbonate skylight | The artist recalls, ΓAbout the same time that ... | https://new.mta.info/agency/arts-design/collec... |
station_lines.head()
agency | station_name | line | |
---|---|---|---|
0 | NYCT | Clark St | 2 |
1 | NYCT | Clark St | 3 |
2 | NYCT | 125 St | 4 |
3 | NYCT | 125 St | 5 |
4 | NYCT | 125 St | 6 |
This week's Tidy Tuesday dataset documents the art displayed within the MTA subway system in New York City. The primary dataset contains each art piece, the artist who completed it, the year it was completed, the material/medium, a description of the art, and a URL to an image of the art, as well as details about where within the subway system it is diplayed. The second dataset spells out additional details about the MTA system in a tidier fashion. (For my purposes today, I will not utilze the second dataset.)
Looking through this very interesting data, I'm intrigued by the idea of gender representation within the art displayed by the MTA. The arts were historically dominated by men with a more recent shift towards female and gender diverse artists. I want to investigate whether or not the art displayed by the MTA tends to be dominated by male artists or whether it is more-or-less equal between the genders.
Step 1: Guess Gender from Artists' NamesΒΆ
#!pip install gender-guesser
import gender_guesser.detector as gg
To investigate the artists' genders, I first need to determine (or guess!) what their genders actually are. The dataset contains only their names, but these can provide a reasonable guess as to what their genders are. To accomplish this, I have installed the gender-guesser package.
d = gg.Detector()
# iterate through df, split artist name and extract first name only, then guess gender based on first name
mta_art['gender'] = [d.get_gender(mta_art['artist'].iloc[i].split()[0]) for i in range(0, len(mta_art))]
mta_art[['artist','gender']].head(20)
artist | gender | |
---|---|---|
0 | Ray Ring | mostly_male |
1 | Houston Conwill | male |
2 | Milton Glaser | male |
3 | Rhoda Andors | female |
4 | David Wilson | male |
5 | Steve Wood | male |
6 | Valerie Jaudon | female |
7 | Matt Mullican | male |
8 | Nitza Tufiβo (in collaboration with Grosvenor ... | unknown |
9 | Arthur Gonzalez | male |
10 | Arthur Gonzalez | male |
11 | Dan Sinclair | male |
12 | Harry Roseman | male |
13 | Kathleen McCarthy | female |
14 | Kathleen McCarthy | female |
15 | Kathleen McCarthy | female |
16 | Nitza Tufiβo | unknown |
17 | Alison Saar | female |
18 | Martha Jackson-Jarvis | female |
19 | Michele Oka Doner | female |
It looks like the gender-guesser package has done a reasonably good job guessing the artists' genders, though it is unsure for certain less common names, such as "Nitza." Just for simplicity's sake, I'd like to replace "mostly_female" and "mostly_male" entries with just "female" and "male." Additionally the gender-guesser package uses the label "andy" for androgynous names, and I'd like to replace that with "unknown."
mta_art['gender'] = mta_art['gender'].replace('mostly_female', 'female')
mta_art['gender'] = mta_art['gender'].replace('mostly_male', 'male')
mta_art['gender'] = mta_art['gender'].replace('andy', 'unknown')
mta_art['gender'].value_counts()
count | |
---|---|
gender | |
male | 178 |
female | 153 |
unknown | 50 |
So there are a total of 178 male artists in the dataset and 153 female artists in the dataset. Given that there are also 50 artists of unknown gender, it's hard to say anything difinitive about gender representation based on the analysis so far.
Step 2: Visualize Gender Distribution Over TimeΒΆ
import matplotlib.pyplot as plt
import matplotlib as mpl
from cycler import cycler
# convert gender to category type variable
mta_art['gender'] = mta_art['gender'].astype("category")
gender_by_year = mta_art.groupby(mta_art['art_date'])['gender'].value_counts()
gender_by_year = gender_by_year.reset_index()
gender_by_year.head(9)
art_date | gender | count | |
---|---|---|---|
0 | 1980 | male | 1 |
1 | 1980 | female | 0 |
2 | 1980 | unknown | 0 |
3 | 1986 | male | 3 |
4 | 1986 | female | 0 |
5 | 1986 | unknown | 0 |
6 | 1987 | female | 2 |
7 | 1987 | male | 1 |
8 | 1987 | unknown | 0 |
I would not like to visualize the artists' gender distribution by year. I have a hunch that male artists will be overrepresented in the earlier years of the MTA program (e.g. the 1980s) but this may shift in favor of female artists over time. To create this visualization, I have first aggregated the gender counts by year.
# adjust colors of plots
mpl.rcParams['axes.prop_cycle'] = cycler(color=['pink', 'lightblue', 'lightgray'])
plt.plot(gender_by_year.loc[gender_by_year['gender']=='female', 'art_date'],
gender_by_year.loc[gender_by_year['gender']=='female', 'count'],
label='Female Artists')
plt.plot(gender_by_year.loc[gender_by_year['gender']=='male', 'art_date'],
gender_by_year.loc[gender_by_year['gender']=='male', 'count'],
label='Male Artists')
plt.plot(gender_by_year.loc[gender_by_year['gender']=='unknown', 'art_date'],
gender_by_year.loc[gender_by_year['gender']=='unknown', 'count'],
label='Gender Unknown')
plt.legend()
plt.title("MTA Artists' Gender Distribution Over Time")
Text(0.5, 1.0, "MTA Artists' Gender Distribution Over Time")
The over-time plot is noisy but informative. It is true that in the very early years of the MTA program, male artists dominate. However, as early as 1990 we can see a much more equitable distribution of both male and female artists, including a few years where there are actually more female artists put on display. There also appears to be a short period in the early 2000s when male artists again pretty noticeably dominate the MTA displays.
mta_art['art_decade'] = np.nan
mta_art.loc[(mta_art['art_date']>=1980) & (mta_art['art_date']<1990), 'art_decade'] = "1980s"
mta_art.loc[(mta_art['art_date']>=1990) & (mta_art['art_date']<2000), 'art_decade'] = "1990s"
mta_art.loc[(mta_art['art_date']>=2000) & (mta_art['art_date']<2010), 'art_decade'] = "2000s"
mta_art.loc[(mta_art['art_date']>=2010) & (mta_art['art_date']<2020), 'art_decade'] = "2010s"
mta_art.loc[(mta_art['art_date']>=2020) & (mta_art['art_date']<2030), 'art_decade'] = "2020s"
/tmp/ipython-input-37-950652811.py:2: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '1980s' has dtype incompatible with float64, please explicitly cast to a compatible dtype first. mta_art.loc[(mta_art['art_date']>=1980) & (mta_art['art_date']<1990), 'art_decade'] = "1980s"
gender_by_decade = mta_art.groupby(mta_art['art_decade'])['gender'].value_counts()
gender_by_decade = gender_by_decade.reset_index()
I'm curious to see if my conclusions change if I visualize this data slightly differently. Instead of by year, I have now aggregated the data by decade, and I will utilize a bar graph instead of a line graph.
# Code taken and modified from:
# https://python-graph-gallery.com/grouped-barplot-with-the-total-of-each-group-represented-as-a-grey-rectangle/
pivot_df = gender_by_decade.pivot(index='art_decade', columns='gender', values='count')
fig, ax = plt.subplots(figsize=(7, 5))
bar_width = 0.25
x = np.arange(len(pivot_df.index))
for i, sub_cat in enumerate(pivot_df.columns):
ax.bar(x + i * bar_width, pivot_df[sub_cat],
width=bar_width, label=sub_cat)
ax.set_xlabel("Decade")
ax.set_ylabel("Count")
ax.set_title("Gender of MTA Artists by Decade", loc="left")
ax.set_xticks(x + bar_width / 2)
ax.set_xticklabels(pivot_df.index)
ax.legend(title="Gender of Artist")
plt.show()
The bar graph is slightly clearer, in my opinion, than the line graph. It also tells a less optimistic story, as it shows male artists dominating the MTA displays through the 1990s and 2000s. A slight shift toward female artists occurs in the 2010s and 2020s, but I would hesitate to say that female artists "dominate" the MTA displays in those years; instead it is simply a more equitable distribution.
Step 3: Visualize Gender Distribution Over SpaceΒΆ
Next I am curious to see if particular subway stations contain more male or female artists on display. I would first like to exclude subway stations that don't have very much art on display. This is partly because there are a LOT of subway stations, and it will be hard to include all of them on a readable visualization. Therefore I will calculate the average number of art pieces per subway station and will use this to decide on a cut point for which stations I want to include in my visualization.
station_totals = mta_art['station_name'].value_counts()
station_totals = station_totals.reset_index()
print("The average number of art pieces per station is", station_totals['count'].mean())
The average number of art pieces per station is 1.233009708737864
station_totals.head(20)
station_name | count | |
---|---|---|
0 | Times Sq-42 St | 7 |
1 | 86 St | 7 |
2 | 34 St-Herald Sq | 4 |
3 | 34 St-Penn Station | 4 |
4 | 125 St | 4 |
5 | Grand Central-42 St | 4 |
6 | Grand Central Terminal | 3 |
7 | 72 St | 3 |
8 | Bay Pkwy | 3 |
9 | 23 St | 3 |
10 | 50 St | 3 |
11 | 18 Av | 3 |
12 | Avenue U | 3 |
13 | Harlem-125 St | 3 |
14 | 96 St | 3 |
15 | Bellmore | 2 |
16 | 5 Av/53 St | 2 |
17 | 33 St | 2 |
18 | Broadway | 2 |
19 | Canal St | 2 |
stations_to_drop = list(station_totals.loc[station_totals['count']<3, "station_name"])
mta_art = mta_art.set_index('station_name')
mta_art_subset = mta_art.drop(stations_to_drop)
mta_art_subset = mta_art_subset.reset_index()
Based on the average of 1.2 pieces of art per station and a brief peak at the aggregated data, I have decided to exclude any stations with less than 3 pieces of art. I will now aggregate the data by station and generate a new bar graph.
gender_by_station = mta_art_subset.groupby(mta_art_subset['station_name'])['gender'].value_counts()
gender_by_station = gender_by_station.reset_index()
# Code taken and modified from:
# https://python-graph-gallery.com/grouped-barplot-with-the-total-of-each-group-represented-as-a-grey-rectangle/
pivot_df = gender_by_station.pivot(index='station_name', columns='gender', values='count')
fig, ax = plt.subplots(figsize=(7, 5))
bar_width = 0.25
x = np.arange(len(pivot_df.index))
for i, sub_cat in enumerate(pivot_df.columns):
ax.barh(x + i * bar_width, pivot_df[sub_cat],
height=bar_width, label=sub_cat)
ax.set_xlabel("Count")
ax.set_ylabel("Subway Station")
ax.set_title("Gender of MTA Artists by Subway Station", loc="left")
ax.set_yticks(x + bar_width / 2)
ax.set_yticklabels(pivot_df.index)
#ax.tick_params(labelsize=8)
ax.legend(title="Gender of Artist")
plt.show()
This final data visualization shows an additional aspect of gender representation within the MTA art displays. Of the stations that have the most art, 7 are dominated by male artists, 6 are dominated by female artists, and 2 have an equal number of male and female artists. This seems like fairly equitable distribution! However, it is of note that the Times Sq-42 St station is heavily dominated by male artists. I suspect that this is of the MTA's most heavily utilized stations, and a quick look at the MTA's ridership data confirms this suspcition. This means that MTA passengers look at art created by men somewhat more frequently than at art created by women, and this may be especially true for tourists and others spending most of their time in the Manhattan area.
ConclusionΒΆ
Gender representation among artists displayed withing the MTA system is a nuanced topic. Though male artists dominated the early years of the program, the MTA has course corrected over time with more and more female artists included. Male artists dominate certain subway stations, while female artists dominate others. However, it is worth noting that the top 3 most utilized subway stations (Times Sq-42 St, Grande Central-42 St, and 43 St-Herald Sq) are all dominated by male artists, indicating that the average MTA rider is still slightly more likely to be exposed to art created by men than by women.