Software: Python
Primary goals:
Understand traffic patterns in Arizona, Nevada, and California
Busiest times to travel, periods with the most delays, and flight status statistics by location
Purpose: inform consumer of the best times and locations to travel to
2. Analyze individual airline performance
Time periods with the greatest delays and ability of an airline to have flights arrive on time
Purpose: help consumer choose the best airline based on least delays
With information to predict delays, it may be possible to mitigate the financial burden of flight delays for passengers, airports, and airlines simultaneously.
Approach:
1. Import necessary modules (pandas, matplotlib, datetime)
2. Load data from CSV and preview
3. Determine data types of each column
4. Convert columns to suitable data types as needed
5. Find null values in dataset and handle accordingly
6. Explore adjusted dataset
Summary:
Columns that must be change are FL_DATE, DEP_TIME, ARR_TIME, CARRIER_CODE, ORIGIN, ORIGIN_ST, DEST, DEST_ST, DEP_DELAY, ARR_DELAY, and ELAPSED_TIME. The columns that may remain the same are TAIL_NUM, FL_NUM, and DISTANCE.
~1.5% of all data has missing values with TAIL_NUM having the fewest missing values. NaN values are not spread out between columns: if there is a NaN value in a column, there will also be an NaN value in ARR_DELAY and ARR_TIME (likely due to cancelled/diverted flights).
We will drop the rows with both DEP_TIME and ARR_TIME null values as the rows do not provide sufficient understanding for the departure and arrival status of the aircraft.
The data can allow us to find anomalies or trends within the aviation industry. For example, the data allow us to see the dates with the highest delay times, and with the proper context, we can gain insight into why.
The data can also allow us to test some trends. For example, it is obvious that flights with higher departure delays will have higher arrival delays, and we can graph a scatter plot to show this.
Finally, we can explore a large amount of data throughout the entire country, displaying the total number of flights originating from each state to show air traffic, allowing for further insights into the best and worst states to travel to based on delays.
Approach:
Compare number of flights in each region using .groupby(), .size(), and .sort_values()
Plot dataframe as a bar chart with departure state on x-axis and number of flights on the y-axis
For flight departures, California has the most air traffic of the three regions. With almost 800,000 flights, CA has nearly 4 times the flight volume of Arizona, and roughly 4.5 times that of Nevada.
Approach:
Define function to print the number of flights meeting parameters for ORIGIN_ST/DEST_ST, ORIGIN/DEST, state abbreviation, and flight type (Departure/Arrival) and plot values
Run function for CA, AZ, NV departures and arrivals
Examine total number of flights departing and arriving from region airports using .value_counts()
Analyze results
Los Angeles (LAX) was the most common departure airport with around 216,000 departing flights. This was about 30% more flights than the second most common departure airport in California, San Francisco (SFO). For arrivals of flights departing from California, Phoenix (PHX) was the most common airport, with around 50,000 arriving flights. Las Vegas (LAS) and San Francisco (SFO) were close second and thirds.
The most common departure airport in Arizona was Phoenix Sky Harbor (PHX). With 172,578 departing flights in 2019, PHX had about 9 times the flight departure volume of the second most common Arizona airport, Tucson (TUS). As for the arrivals, for flights departing from Arizona, no additional AZ airport made the most common arrival list. Instead, Denver (DEN) led the arrivals from Arizona with 10,197 flights, and LAX was a close second with around 9,600.
Nevada only had three departing airports in this dataset, the most common being Las Vegas (LAS), with 161,620 departing flights in 2019. The second most common departure airport was Reno (RNO) with about 20,000 departing flights. As for arriving flights that departed from Nevada, the most common airport was Los Angeles (LAX) with 13,834 flights, while San Francisco (SFO), Denver (DEN), Phoenix (PHX), and Seattle-Tacoma (SEA) ranged from 9,200 to 7,100 respectively.
In this initial analysis of the data, a few airports stick out as some of the most common. These include LAX and SFO for California, PHX for Arizona, LAS for Nevada, SEA for Washington, as well as DEN for Colorado.
Approach:
Calculate proportion of flights per operator using .groupby(), .size(), and .sort_values()
Plot proportions of top 10 operators using a bar chart
Southwest (WN) made up 29.15% of all flights to and from these three regions in 2019, nearly doubling the number of flights by second-place operator American Airlines (AA), which made up 15.07% of all flights. SkyWest operates regional flights for American Eagle, Delta Connection, United Express, and Alaska Airlines, so this might increase their respective percentages. While American Airlines, United Airlines, Delta Airlines, and JetBlue Airways are still considered some of the larger airlines in the United States, they are not the main carriers in the West, or at least not in the three regions in this dataset.
Approach:
Create variable for top 10 carriers using .groupby(), .size(), and .sort_values()
Determine both average departure delays and average arrival delays by airline and plot on bar charts using .mean() and .Timedelta()
Determine both percentage of departure delays and percentage of arrival delays by airline using .groupby() and .Timedelta() and plot on bar charts
Filter data by most common airports and plot percentage of delayed departures and delayed arrivals by operator for each airport
Frontier Airlines (F9) has the highest average departure delay of almost 20 minutes, followed by Republic Airways (YX) with 18.5. JetBlue Airways (B6) was third with an average delay of 16:28. Alaska Airlines (AS) was the operator with the shortest average departure delay of 10:43, followed by Delta Airlines (DL) with 10:48, and Hawaiian (HA) with 11:02.
Frontier Airlines (F9) remains the operator with the worst record, with an average of 19:00 minutes. Republic Airways (YX) and JetBlue Airways (B6) once more follow in order, with average arrival delays of 18:22 and 16:03 respectively. The operators with the lowest average arrival delays are Envoy Air (MQ), with an average of 10:34, Delta Air Lines (DL) with 10:30, and Southwest Airlines, with an average arrival delay of 9:36 minutes.
While Hawaiian Airlines (HA) has some of the lowest average departure and arrival delays, they have the greatest percentage of delayed departures and arrivals (48.04% and 50.33% respectively). Meanwhile, Republic Airways (YX) has some of the greatest average departure and arrival delays, but the smallest percentage of departure and arrival delays (25.32% and 27.27% respectively). Frontier (F9) has both large average delays and a large percentage of delayed flights (43.26% of departures and 42.02% of arrivals). Most operators seem to have delay percentages that fall in the range of 30-40 percent.
Approach:
Find top 10 operators with best record for on-time departures using .Timedelta(), .groupby(), and .size() and print results
Plot top 10 operators on bar chart with operator on x-axis and % of on-time departures on y-axis
Repeat process to find top 10 operators with best record for on-time arrivals
The top three operators for on-time departures are also the top three operators for on-time arrivals, but this pattern breaks for the rest of the top 10 most punctual operators. The operator with the greatest percentage of on-time flights is Republic Airways (YX), with 74.68% and 72.73% (-1.95% change) on-time departures and arrivals respectively, which operates flights for Delta Connection, United Express, and American Eagle. The second most punctual operator is SkyWest Airlines (OO), with 74.40% on-time departures and 69.70% on-time arrivals (-4.70%). SkyWest operates flights for American Eagle, Delta Connection, United Express, and Alaska Airlines.
On-time departures do not necessarily relate with on-time arrivals, which may be due to longer taxiing times, airport congestion once a plane departs the gate, or air traffic control delays.
The top three most punctual operators are all operators for major airlines and not the major airlines themselves. These operators may focus only on executing flights while main airlines handle other operations.
Approach:
Find airlines with most overall flight time using .sum() and .groupby()
Calculate total flight hours per month of top 10 operators using .dt.month(), .isin(), .Timedelta(), and .sum()
Find total flight hours for each month by airline using .dt.month(), .Timedelta(), .astype() and .sum()
Plot monthly flight hours by airline using matplotlib with month on x-axis and flight hours on y-axis
We see the operators tend to follow an order, meaning that if one operator has more flight hours one month, another operator does not have more flight hours on a different month. This pattern is only broken by SkyWest Airlines and Alaska Airlines in December. The lowest number of flight hours occurs in February (382,000 hours), with dips also occurring in September and November. There are upward spikes in March, October, and December, presumably travel for spring break, winter holidays, and other vacation times. The month with the greatest number of flight hours among theses 10 operators is August, with 470,000 hours.
Approach:
Find top three aircraft from top three operators and assign variables to each aircraft registration using .groupby(), .size(), .sort_values(), and .index()
Define function to convert decimal time to M:SS format
Define function to return statistics about aircraft with given registration, dataframe, and location type (DEP or ARR)
Define function to return departures/arrivals statistics with given registration and airline, and plot bar chart with most common airports for the registration
Run functions for all three aircraft and compare results
Aircraft #1:
The aircraft flew 1067 flights to or from locations in CA, NV, or AZ, an average of 2.92 daily, and had departure delays on 499 (46.77% of all flights), with an average departure delay of 11.5 minutes. However, it only had 381 arrival delays (35.71%), with an average arrival delay of 9:41 minutes. N7875A most often traveled to Las Vegas (LAS) with 132 flights. N7875A has the longest departure delays at New Orleans (MSY) with 112.5 minutes on average, and the worst arrival airport is Birmingham (BHM) at 52.5 minutes. The best departure airports were Orlando (MCO), Midland (MAF), and Oklahoma City (OKC) with 0 minute delays. The best arrival airports were Baltimore (BWI), Detroit (DTW), and Little Rock (LIT) with 0 minute delays.
Aircraft #2:
The aircraft flew 957 flights, or about 2.62 daily, of which 348 departures (36.36%) were delayed, and 308 arrivals were delayed (32.18%). The average departure delay was 12 minutes, while the average arrival delay was 11:46. N130AN focused around Los Angeles (LAX), as it flew here 471 times, nearly 5 times the second most common airport, Honolulu (HNL). N130AN had the worst departure delay from DCA with a 15.83 minute average. Sacramento (SMF) was the worst arrival airport, with an average arrival delay of 216 minutes. Sacramento and Austin (AUS) had the best average departure delay, 0 minutes. Washington D.C. and Philadelphia (PHL) had the shortest average arrival delays, both 0 minutes as well.
Aircraft #3:
This aircraft flew 799 flights to or from CA, AZ, or NV, for an average of 2.13. Of those, 191 (23.9%) were delayed at departure, while 232 (29.04%) were delayed upon arrival, making this the only aircraft of the three to have more arrival delays than departure delays. The average departure delay was 10:18 minutes, while average arrival delay was 11:02. Its most common airport was San Francisco (SFO) where it flew 213 times, followed by Los Angeles (LAX). Seattle-Tacoma (SEA) has the worst average departure and arrival delays of 28.08 minutes and 33.6 minutes, respectively. The best departure airports were Orlando (MCO), Ontario (ONT), and Kansas City (MCI), and the best arrival airports were Salt Lake City (SLC), Nashville (BNA), and Baltimore (BNA), all with 0 minute delays.
This project was the first group project we completed during an asynchronous course over the summer. Due to the course being virtual, it was a little more difficult to coordinate team efforts and meeting times, however I think our group’s ability to clearly divide tasks between members helped us to complete the project by the deadline with a high-quality result.
As a programmer, this project helped build my exploratory data analysis, data cleaning, and visualization skills. Being tasked with creating so many unique visualizations with different focuses in the dataset required us to be intentional with our visualization approach and data to include. I also gained practice in writing data visualization descriptions and interpretations, an invaluable skill in real-world data practices. With this in mind, I am proud of our group’s final product and efforts and believe this project strengthened my ability to clean, filter, and interpret data.