Tidy Tuesday for July 8th, 2025ΒΆ

The XKCD Color SurveyΒΆ

Irene MorseΒΆ

Setup and IntroductionΒΆ

InΒ [24]:
# for dataframe wrangling
import pandas as pd
InΒ [25]:
answers = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-08/answers.csv')
color_ranks = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-08/color_ranks.csv')
users = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-08/users.csv')
InΒ [26]:
users['colorblind'].value_counts()
Out[26]:
count
colorblind
0.0 136965
1.0 5588

For this week's Tidy Tuesday dataset, I'm inspired by the idea of colorblindness. The dataset contains about 5500 users who self-identify as colorblind, with the reamining 136k+ saying they are not colorblind. I am curious about the ways in which colorblind users perceive colors differently from non-colorblind users. More specifically, do colorblind users perceive a wider range of hex codes as a particular named color, as compared to non-colorblind users? For example, does the concept of "green" reflect a wider range of color codes for colorblind users than for non-colorblind users?

Step 1: Merge Datasets and Remove SpamΒΆ

InΒ [27]:
answers_w_colors = pd.merge(answers, color_ranks.drop('hex', axis=1), on='rank', how="inner")
InΒ [28]:
users_answers = pd.merge(answers_w_colors, users, on='user_id')
InΒ [29]:
users_answers.head()
Out[29]:
user_id hex rank color monitor y_chromosome colorblind spam_prob
0 1 #8240EA 1 purple LCD 1.0 0.0 0.002088
1 2 #4B31EA 3 blue LCD 1.0 0.0 0.074577
2 2 #584601 5 brown LCD 1.0 0.0 0.074577
3 2 #DA239C 4 pink LCD 1.0 0.0 0.074577
4 2 #B343E5 1 purple LCD 1.0 0.0 0.074577

The merged dataset contains data at the user-color level. Each row shows how a particular user labeled a unique color (HEX code). It also indicates whether or not the user self-identified as colorblind, the probability that the user response was spam, and other info. For example, the top row of the dataset indicates that user 1 was shown color #8240EA and labeled it as "purple." This user does not identify as colorblind.

InΒ [30]:
users_answers['spam_prob'].describe()
Out[30]:
spam_prob
count 1.058211e+06
mean 1.259867e-01
std 2.704004e-01
min 7.416590e-05
25% 3.713345e-02
50% 9.042199e-02
75% 1.634402e-01
max 8.428621e+00

I would like to remove user responses that have a high probability of being spam. A closer look at the "spam_prob" variable shows that it ranges from 0.0000741 to 8.42. It is unclear how this probability value has been scaled; therefore, I will remove users that are estimated to be spam with probability higher than the mean "spam_prob" value.

InΒ [31]:
users_answers_clean = users_answers[users_answers['spam_prob'] < users_answers['spam_prob'].mean()]
InΒ [32]:
print("This reduces the dataset from ", len(users_answers), "users to ", len(users_answers_clean), "users.")
This reduces the dataset from  1058211 users to  672374 users.

Step 2: Convert HEX Color Codes to RGB Color Codes for Better Spatial MappingΒΆ

InΒ [33]:
# for HEX to RGB
from PIL import ImageColor
# for cleaner appearance
import warnings
warnings.filterwarnings('ignore')
InΒ [34]:
# convert HEX to RGB
users_answers_clean['rgb'] = users_answers_clean['hex'].apply(ImageColor.getcolor, mode="RGB")
InΒ [35]:
# split apart the RBG tuple into separate variables
users_answers_clean['red'] = [users_answers_clean['rgb'].iloc[i][0] for i in range(len(users_answers_clean))]
users_answers_clean['green'] = [users_answers_clean['rgb'].iloc[i][1] for i in range(len(users_answers_clean))]
users_answers_clean['blue'] = [users_answers_clean['rgb'].iloc[i][2] for i in range(len(users_answers_clean))]
InΒ [36]:
# convert colorblind to category type variable
users_answers_clean['colorblind'] = users_answers_clean['colorblind'].astype("category")
InΒ [37]:
users_answers_clean.head()
Out[37]:
user_id hex rank color monitor y_chromosome colorblind spam_prob rgb red green blue
0 1 #8240EA 1 purple LCD 1.0 0.0 0.002088 (130, 64, 234) 130 64 234
1 2 #4B31EA 3 blue LCD 1.0 0.0 0.074577 (75, 49, 234) 75 49 234
2 2 #584601 5 brown LCD 1.0 0.0 0.074577 (88, 70, 1) 88 70 1
3 2 #DA239C 4 pink LCD 1.0 0.0 0.074577 (218, 35, 156) 218 35 156
4 2 #B343E5 1 purple LCD 1.0 0.0 0.074577 (179, 67, 229) 179 67 229

Step 3: Visualize Perceptual Differences Between Colorblind Users and Non-Colorblind UsersΒΆ

InΒ [38]:
# for ternary plots
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook_connected"

I will use ternary plots to help visualize the differences in color labeling between colorblind and non-colorblind users. Ternary plots provide a way to graph three variables in proportion to one another. This works well for the RGB color model, which represents unique colors as ratios of red (R), green (G), and blue (B). Below I provide an example of how 5 randomly selected RGB colors from the dataset can be depicted on a ternary plot.

InΒ [39]:
five_colors = users_answers_clean.sample(n=5, random_state=12)
# add a label variable to allow color mapping in plotly
five_colors["label"]=["color1","color2","color3","color4","color5"]
InΒ [40]:
fig = px.scatter_ternary(five_colors, a="red", b="blue", c="green", color="label",
                         color_discrete_map={
                             "color1": ("rgba" + str(five_colors['rgb'].iloc[0])[:-1] + ", 1.0)"),
                             "color2": ("rgba" + str(five_colors['rgb'].iloc[1])[:-1] + ", 1.0)"),
                             "color3": ("rgba" + str(five_colors['rgb'].iloc[2])[:-1] + ", 1.0)"),
                             "color4": ("rgba" + str(five_colors['rgb'].iloc[3])[:-1] + ", 1.0)"),
                             "color5": ("rgba" + str(five_colors['rgb'].iloc[4])[:-1] + ", 1.0)")},
                         title="Ternary Plot of Five Random RGB Colors")
fig.update_layout(showlegend=False)
fig.show()
InΒ [41]:
users_answers_clean['color'].value_counts()
Out[41]:
count
color
green 183775
blue 177761
purple 172333
pink 87738
brown 50767

Users labeled their HEX codes using five overarching color labels: green, blue, purple, pink, and brown. Let's take a look at each color individually to see how colorblind users labeled the HEX codes differently (or perhaps not so differently) from non-colorblind users.

RBG Colors Labeled as GreenΒΆ

InΒ [42]:
greens = users_answers_clean[users_answers_clean['color']=="green"]
InΒ [43]:
fig = px.scatter_ternary(greens, a="red", b="blue", c="green", color="colorblind",
                         color_discrete_map = {1:"rgba(130, 200, 50, 0.5)", 0:"rgba(50, 90, 50, 1.0)"},
                         title="RGB Colors Labeled as Green by Colorblind and Non-Colorblind Users")
fig.show()

The range of RBG colors labeled as "green" by colorblind and non-colorblind users is more-or-less similar. Colorblind users appear to be slightly more likely to label redder greens as "green" than non-colorblind users, but the differences between the groups are marginal.

RBG Colors Labeled as BlueΒΆ

InΒ [44]:
blues = users_answers_clean[users_answers_clean['color']=="blue"]
InΒ [45]:
fig = px.scatter_ternary(blues, a="red", b="blue", c="green", color="colorblind",
                         color_discrete_map = {1:"rgba(90, 230, 230, 0.5)", 0:"rgba(50, 50, 90, 1.0)"},
                         title="RGB Colors Labeled as Blue by Colorblind and Non-Colorblind Users")
fig.show()

For RGB colors labeled as "blue," we can see some more systematic differences between colorblind users and non-colorblind users. Colorblind users are more likely to label redder blues as "blue," while non-colorblind users are more likely to label greener blues as "blue."

RBG Colors Labeled as PurpleΒΆ

InΒ [46]:
purples = users_answers_clean[users_answers_clean['color']=="purple"]
InΒ [47]:
fig = px.scatter_ternary(purples, a="red", b="blue", c="green", color="colorblind",
                         color_discrete_map = {1:"rgba(220, 170, 250, 0.5)", 0:"rgba(120, 10, 150, 1.0)"},
                         title="RGB Colors Labeled as Purple by Colorblind and Non-Colorblind Users")
fig.show()

Non-colorblind users are likely to label a wider range of RGB colors as "purple" when compared to colorblind users. Colorblind users also appear to be slightly more likely to label bluer-greener purples as "purple."

RBG Colors Labeled as PinkΒΆ

InΒ [48]:
pinks = users_answers_clean[users_answers_clean['color']=="pink"]
InΒ [49]:
fig = px.scatter_ternary(pinks, a="red", b="blue", c="green", color="colorblind",
                         color_discrete_map = {1:"rgba(255, 195, 240, 0.5)", 0:"rgba(255, 70, 210, 1.0)"},
                         title="RGB Colors Labeled as Pink by Colorblind and Non-Colorblind Users")
fig.show()

Pink displays very similar trends to purple, with non-colorblind users labeling a wider range of RGB colors as "pink" and colorblind users slightly more likely to label bluer-greener pinks as "pink."

RBG Colors Labeled as BrownΒΆ

InΒ [50]:
browns = users_answers_clean[users_answers_clean['color']=="brown"]
InΒ [51]:
fig = px.scatter_ternary(browns, a="red", b="blue", c="green", color="colorblind",
                         color_discrete_map = {1:"rgba(180, 125, 40, 0.5)", 0:"rgba(100, 65, 10, 1.0)"},
                         title="RGB Colors Labeled As Brown By Colorblind And Non-Colorblind Users")
fig.show()

Similar to blue, the plot for brown shows some systemic differences between colorblind users and non-colorblind users. Non-colorblind users are much more likely to identify redder browns as "brown." They also see a wider range of RGB colors as "brown." In contrast, colorblind users are more likely to identify greener browns as "brown."

ConclusionΒΆ

The differences between colorblind and non-colorblind users are not as straightforward as I originally thought. The plots reveal that colorblind users do not necessarily see a wider range of unique colors as a particular named color. If anything, the opposite is often the case; colorblind users appear to be more conservative in applying the color labels than non-colorblind users, especially for purple and pink. The differences between the two groups are much more nuanced and are most evident for the blue and brown labels.