Software: Python
As we progress into advanced programming, our focus turns more to reusability, code maintainability, and extensibility. In simple projects, a one-off analysis of some dataset may be all that is required. But what if what we want to do is build a framework that can handle a variety of datasets? In this project, we will build a reusable framework for comparative text analysis and demonstrate its reusability by comparing a set of documents.
Natural language processing (NLP) library requirements:
The library should be implemented as the object instance of a class.
You should be able to load or register up to 10 related text files. You may choose the text files, but they should be related in some way.
Each text file that you load should be pre-processed so that the pre-processor cleans the data, removing unnecessary whitespace, punctuation, and capitalization. You may want the pre-processor to gather some statistics such as average sentence or word length, readability scores, sentiment, and so on.
Implement a generic parser and pre-processor for handling simple unstructured text files. Also implement the ability to specify a custom domain-specific parser. Now, when registering the file, you can specify custom parsing function that will carry out the parsing and pre-processing of your unique files!
Parsing exceptions should be handled with a framework-specific exception class (inherited from Exception).
The library will support three visualizations, one of which is very specific, the other two are left at your discretion.
(Specific) Text-to-Word Sankey diagram. Given the loaded texts and either a set of user-defined words OR the set of words drawn from the k most common words of each text file, generate a Sankey diagram from text name to word, where the thickness of the connection represents the wordcount of that word in the specified text.
(Flexible) Any type of visualization containing sub-plots, one sub-plot for each text file. For example, an array of word clouds, one for each text file would satisfy this requirement, but be sure to write your code so that the subplot array dimensions can handle from 2-10 submitted documents. If you want the dimensions of the subplot array could be overridden by the user, but you should define reasonable defaults.
(Flexible) Any type of comparative visualization that overlays information from each text file onto a single visualization.
The data I chose to use for this project was song lyrics by The Beatles from AZLyrics.com. I chose the lyrics for the first song from their albums “Please Please Me”, “With the Beatles”, “A Hard Day’s Night”, “Beatles for Sale”, “Help!”, “Rubber Soul”, “Revolver”, “Sgt. Pepper’s Lonely Hearts Club Band”, “The Beatles (White Album)”, and “Abbey Road” to examine song lyrics across albums released between 1962 and 1970. In doing so, I aim to identify and characterize the band’s stylistic changes to better understand The Beatles’ musical evolution. The analysis was capped at 10 “documents” (10 sets of lyrics), which is why some other Beatles albums were not included.
The stopwords.txt file was also used to filter out stopwords from the song lyrics to ensure the lyrical analysis yielded meaningful results.
To do this, I created three files: lyricool.py, lyricool_app.py, and lyricool_parsers.py.
lyricool.py: A reusable library for lyric analysis and comparison. In theory, the framework should support any lyrics of interest, although a custom parser may be useful if pulling from website (such as AZLyrics.com).
Within this file, I created a Lyricool class, which included functions like:
__init__ – Constructor (ex: datakey --> (filelabel --> datavalue))
load_stop_words – Load in stopwords file and return list of stopwords for future filtering
default_preprocessor – Pre-process standard text file of lyrics for future parsing and visualization. Filter stopwords from stopwords file, if provided.
default_parser – Parse standard text file of lyrics and produce extracted data results in the form of a dictionary
load_text – Register a document (lyrics) with the framework. Extract and store data to be used later by the visualizations.
wordcount_sankey – ap each text (lyrics) to words using a Sankey diagram, where the thickness of the line is the number of times that word occurs in the text. Users can specify a particular set of words, or the words can be the union of the k most common words across each text file (excluding stop words).
most_frequent_words_subplot – Determine n-most frequent words in each text file (lyrics) and create subplots of n-most frequent words for each file in selected_labels (one sub-plot for each lyrics set)
polarity_subjectivity_scatterplot – Create a scatterplot comparing polarity vs. subjectivity of each text file (lyrics) in selected_labels
This file also had a LyricoolParsingError class for handling errors in parsing lyrics.
lyricool_app.py: Apply text analysis library to multiple lyrics
This file only had a main() function where an object of the Lyricool class was instantiated, lyrics data was loaded in for desired songs, the three visualizations were generated, and parsing-specific errors were handled using the LyricoolParsingError class.
lyricool_parsers.py: A custom parser to pre-process and parse song lyrics from AZLyrics.com. Final output of parser will be a dictionary containing wordcount, numwords, avg_word_length, unique_word_count, type_token_ratio, sentiment, polarity, subjectivity, and emotions for given lyrics.
This file has two functions:
az_lyrics_preprocessor – Pre-process lyrics from an AZLyrics URL for future parsing and visualization. Filter stopwords from stopwords file, if provided.
az_lyrics_parser – Parse song lyrics from an AZLyrics URL and produce extracted data results in the form of a dictionary
Errors were still handled using the LyricoolParsingError class from lyricool.py.
lyricool_parsers.py code sample
Text-to-word sankey diagram
Subplots of most frequent words in selected songs
Scatterplot of polarity vs. subjectivity across selected songs
The analysis of Beatles lyrics reveals thematic shifts across their discography. Early albums (“Please Please Me” to “Beatles for Sale”) center on love and fleeting heartbreak, reflecting their pop origins. Mid-phase albums (“Help!” to “Revolver”) explore deeper themes of self-doubt, political critique (e.g., “Taxman”), and existential questioning. Late albums (“Sgt. Pepper's Lonely Hearts Club Band” to “Abbey Road”) showcase experimental approaches influenced by psychedelics and innovative production techniques, signaling their artistic evolution. The song-to-word Sankey diagram highlights word overlaps across albums, with “It Won't Be Long” and “Sgt. Pepper's Lonely Hearts Club Band” sharing the most words with others (second and eighth album, respectively). Within the terms most repeated, “home” (5) and “night” (4) particularly show how The Beatles evolve from romantic metaphors in early albums to literal references in later ones, reflecting a shift in tone and maturity.
The polarity-subjectivity scatterplot shows most songs as sentimentally neutral (0 polarity) and balanced in subjectivity (0.5). Notable outliers like “Taxman” (negative and subjective), “Drive My Car” (more subjective), “Help!”, or even “Back in the USSR” (more objective) reveal thematic and emotional diversity over time as these albums differ in phases, although all earlier albums seem to be consistently balanced in polarity and subjectivity. These findings provide a data-driven perspective on The Beatles' lyrical evolution, linking their themes to broader societal shifts of the 1960s, including changing views on love, identity, and politics. If given more time, I would like to analyze all songs by album to analyze how tone can change within a given album. I would also like to add more albums to the analysis, such as “Magical Mystery Tour” or ”Let It Be”, and compare Beatles lyrics to lyrics of other artists from the 1960s to contextualize The Beatles’ audience appeal at their peak.
This project was an extremely valuable learning experience in building a reusable and extensible framework for comparative text analysis, emphasizing clean code, maintainability, and adaptability. While many of my previous coding projects were only capable of taking in a specific file/file path for data, this project pushed me to think about an accessible and flexible user experience wherein a user can analyze any set of song lyrics (on AZLyrics.com). As a result, I was encouraged to not only think about how to write the code, but also how data is accessed at a fundamental level, something that is important to account for as I continue to advance in my programming skills. In fact, one challenge I faced during this project was getting temporarily blocked from AZLyrics.com since I was using my custom parser so frequently the website thought I was a robot. With this in mind, I realized that I shouldn’t take data for granted, and there are many unexpected obstacles that can arise in the course of data extraction and parsing.
In addition to expanding my natural language processing knowledge, this project allowed me to improve my data visualization skills. In particular, the second visualization of the subplots for most frequent words by song challenged me to be intentional about graph size, font size, layout, etc. since there was so much information to be presented. This translated into very specific lines of code regarding subplot size, color, etc. so that the final result was user-friendly and readable.
Finally, like many of my other projects, this project allowed me to think about code structure from a broader perspective as I was structuring the Lyricool library. By splitting up my code into multiple files and having each file serve a distinct purpose, I was able to make my code more digestible and ensure each function contributed to the overall goals of the project. This approach enabled me to be a more attentive programmer, avoiding redundancy and excessiveness that can occur when all the code is compiled together in one file.