Software: Python
In this project, we survey the artists represented at the Museum of Contemporary Art (MCA) Chicago. The dataset is a JSON file containing over 10,000 artists whose works are displayed at the museum. The goal is to generate Sankey visualizations using the plotly python library that show connections between artist nationality, the decade when they were born, and their gender. Which countries, decades and genders are most represented?
The data is in a JSON file format, listing each artist by name, bio, nationality, gender, birth year, death year, “constituent ID”, “Wiki QID”, and “ULAN”.
artists.json sample
To do this, I created two files: sankey.py and main.py.
sankey.py: A wrapper Library for plotly sankey visualizations and helper functions to prepare data for sankey visualization
Functions include:
aggregate_data – Aggregate data from df grouping by col1 and col2, adding a column val_col with size of groups. Return this data as a dataframe.
clean_data – Clean given dataframe to filter out rows where decade is 0 or there is missing/Nan data. Filter out rows where 'ArtistCount' is below count_threshold. Return resulting dataframe.
code_mapping – Map labels in src and targ columns to integers. Return dataframe of src, targ, and vals columns as well as list of labels.
make_sankey – Create and return a sankey figure using parameters:
df - Dataframe
src - Source node column
targ - Target node column
cols - Optional additional columns to be used for multi-layered sankey diagrams
vals - Link values (ex: thickness, pad)
count_threshold - Number of artists to filter data by for multi-layer diagrams; default is 20
main.py: Execute functions in sankey.py in main() to create sankey diagrams
main.py code sample
Sankey diagram mapping artist nationality to decade of birth
Sankey diagram mapping artist nationality to gender
Sankey diagram mapping artist gender to decade of birth
Multi-layered sankey diagram mapping artist nationality to gender to decade of birth
Overall, these observations show that there has been a lack of diversity, equity, and inclusion in the art world. The vast majority of artists in the data are white men, with women only starting to be represented 70 years after the first male artists in the data. Even when women are represented, their presence pales in comparison to male artists, which are represented in larger numbers and for more years included in the dataset. This includes “later” years in the dataset, when we would expect more of an emphasis to be placed on female artists (in 1970, there were about half as many female artists as there were male artists).
From a nationality perspective, as mentioned, most artists were American followed by countries in Europe with a few from South America and Asia. In other words, most of the artists were from predominantly white countries. With all this in mind, there is a significant bias in the art world for the years represented in this dataset (and for years after) that needs to be addressed to ensure artists from a wide variety of racial, ethnic, and gender backgrounds have an equal chance to display their work for the public.
Through this assignment, I refined my ability to extract meaningful narratives from data and present them visually. Creating the Sankey diagrams required cleaning and aggregating data, debugging Python code, and fine-tuning visual elements for clarity and impact. Interpreting the results deepened my understanding of how data science can illuminate broader social and cultural issues, aligning technical insights with ethical considerations. This experience reinforced the value of data visualization as a tool for driving conversations about diversity and inclusion. By working on this project, I not only improved my technical skills but also gained a richer perspective on how data science can contribute to equity-focused initiatives in various fields.