Nowadays, airline travel has become a mainstream of transportation. Fueled by a tremendous demand in business and travel, there is no doubt that, flight delays or cancellations are frustrating for any air travelers and costly to airlines and passengers (Ball et al., 2010). In the past ten years, among the 12 biggest U.S. airlines (e.g. United Airlines, American Airlines, etc), more than 20% of the commercial flights delay. Punctuality is an issue for all major carriers, with some having more serious delays than others, through 2012, Frontier Airlines flights were on time just 75% of the time, and last year more than a quarter of its flights were delayed (popularmechanics.com).
There are various delayed causes in reality. For example, about 30% of all tardy flights between June of 2015 and June 2016 can be blamed directly on inclement weather(businessinsider.com). But in many other cases, historical data would suggest that some other flights are much more likely to be delayed due to carrier delay. Concurrently, as mentioned previously, the airline itself is an apparent predictor of the phenomenon that a flight is delayed; despite these possible reasons, there are many factors that are able to allow us to gauge the likelihood and pattern of a flight being delayed. The aim of this project is to use relatively large historical datasets from 2012 to 2016, to explore factors and patterns associating with US. flight delays and to compare them through different perspectives.
Now, large amounts of detialed data regarding industries and government are publicized on the internet, allowing us to investigate airline statistics. With data in hand, we can look at the raw flight data and explore what factors associated with changes in flight delay between 2012 and 2016. We want to assess the data in three different aspects: major causes of flight delay, time pattern of flight delay, and the average delay time per delayed flights of U.S. domestic airlines.
In major causes of flight delay, we want to evaluate what major causes have the large frequency within the five years. Then, after looking at the time pattern of flight delay, we want to observe what is the time pattern of flight delay and the average delay time per delayed flights of different U.S. airlines. Additionally, we want to explore wether there is an significant difference in the mean delayed minutes casued by late aircraft delay between 2012 and 2016.
We used database “Airline On-Time Performance Data”, which comes from Bureau of Transportation Statistics (BTS). The database is accessed from here. Data can be download by clicking on “Download” under “Data tools”.
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
We checked variable boxes that we interested in in database download page, and made selections from the “Filter Year” and “Filter Period” drop-down lists, downloaded data month by month from 2012 January to 2016 December (in total 60 csv files were downloaded). After downloading, unzip each zip file and place all unzip files in a same folder under the project repository. We merged 60 csv files by column names (using function rbind()
).
Please note that there wasn’t an API to speed this process up or make it reproducible.
We selected 23 variables related to our initial questions from 109 variables, and chose time range from 2012 Janurary to 2016 December. Our insterested variables fall into three categories based on variable types:
In indicator variables, we have:
dep_del15
: Departure Delay Indicator, 15 Minutes or More (1=Yes)arr_del15
: Arrival Delay Indicator, 15 Minutes or More (1=Yes)cancelled
: Cancelled Flight Indicator (1=Yes)diverted
: Diverted Flight Indicator (1=Yes)In continuous variables, we have:
dep_time
: Actual Departure Time (local time: hhmm).dep_delay
: Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.arr_time
: Actual Arrival Time (local time: hhmm).arr_delay
: Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.distance
: Distance between airports (miles).carrier_delay
: Carrier Delay, in Minutes. Carrier Delay such as: Aircraft cleaning, Maintenance, Late crew.weather_delay
: Weather Delay, in Minutes. Weather Delay such as: Below minimum conditions, Thunder Storm, Tornado.nas_delay
: National Air System Delay, in Minutes. National Air System Delay such as : Air Traffic Control (ATC), Bird strikes, Closed Runways.security_delay
: Security Delay, in Minutes. Security Delay such as: Lines at screening area that exceed standard time, Bomb threat, Inoperative screening equipment - TSA.late_aircraft_delay
: Late Arriving Aircraft Delay, in Minutes. Late Arriving Aircraft means a previous flight with the same aircraft arrived late which caused the present flight to depart late.In categorical variables, we have:
year
: Yearmonth
: Monthday_of_month
: Day of Monthday_of_week
: Day of Weekunique_carrier
: Unique Carrier Code (Airline). (Carrier Code for Airline can be found here)fl_num
: Flight Numberorigin
: Origin Airportdest
: Destination Airportcancellation_code
: Specifies The Reason For Cancellation. (Cancellation Code Description can be found here)A flight is counted as “on time” if it operated less than 15 minutes later the scheduled time shown in the carriers’ Computerized Reservations Systems (CRS). Arrival performance is based on arrival at the gate. Departure performance is based on departure from the gate. In all of our analysis, we apply arr_del15
=1 as indication of flight delays. More information can be found here.
Furthermore, since we only select the flights that were delayed, all calculations and analysis performed in this study are among the data of delayed flights.
library(data.table)
library(janitor)
library(tidyr)
library(ggplot2)
library(plotly)
library(stringr)
library(forcats)
library(tidyverse)
library(DT)
load_data = function(path){
files = dir(path, pattern = "\\.csv", full.names = TRUE)
tables = lapply(files, fread) ## search every csv file in the data folder
do.call(rbind, tables) ## combine 60 csv files by the name of columns using rbind() function
}
delay_data <- load_data("/srv/data/cumc/flightdelay/delay_data_5yrs") %>% ## When reproduce this project, please change the path to "./delay_data_5yrs"
clean_names() %>%
mutate(month, month = factor(month, labels = month.name), ## change month from number to month.name
dep_time = as.character(dep_time),
dep_time = substr(dep_time,1,2)) ## condense 1-59 minutes of each hour into one data point, eg., 1:00-1:59 -> 1:00
There are 29722792 observations of 23 variables in delay_data dataset. Each observation represent one flight. Domestic flights operated by large air carriers from 2012 January to 2016 December were included in the dataset. Variables fall into three categories based on variable types: continuous variables, categorical variables, and indicator variables, including summary information on the flights performance and descriptive information about flights, such as arr_delay
, day_of_month
, etc.
Descriptive statistics for continous variables are in the chart below.
str(delay_data)
## 'data.frame': 29722792 obs. of 24 variables:
## $ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ month : Factor w/ 12 levels "January","February",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ day_of_month : int 28 28 28 28 28 28 28 28 28 28 ...
## $ day_of_week : int 6 6 6 6 6 6 6 6 6 6 ...
## $ unique_carrier : chr "DL" "DL" "DL" "DL" ...
## $ fl_num : chr "115" "116" "117" "120" ...
## $ origin : chr "ATL" "ORF" "ATL" "JFK" ...
## $ dest : chr "LAX" "ATL" "ORF" "LAX" ...
## $ dep_time : chr "17" "14" "17" "08" ...
## $ dep_delay : num -1 -5 0 -6 0 0 6 -1 -4 14 ...
## $ dep_del15 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ arr_time : chr "1935" "1623" "1859" "1219" ...
## $ arr_delay : num -25 -12 -15 -13 -19 -18 -11 -2 -23 25 ...
## $ arr_del15 : num 0 0 0 0 0 0 0 0 0 1 ...
## $ cancelled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cancellation_code : chr "" "" "" "" ...
## $ diverted : num 0 0 0 0 0 0 0 0 0 0 ...
## $ distance : num 1947 516 516 2475 2475 ...
## $ carrier_delay : num NA NA NA NA NA NA NA NA NA 14 ...
## $ weather_delay : num NA NA NA NA NA NA NA NA NA 0 ...
## $ nas_delay : num NA NA NA NA NA NA NA NA NA 11 ...
## $ security_delay : num NA NA NA NA NA NA NA NA NA 0 ...
## $ late_aircraft_delay: num NA NA NA NA NA NA NA NA NA 0 ...
## $ v24 : logi NA NA NA NA NA NA ...
delay_data[,c(9, 10, 12, 13, 18, 19:23)] %>% ## descriptive analysis
summary()
## dep_time dep_delay arr_time arr_delay
## Length:29722792 Min. :-251.0 Length:29722792 Min. :-411.0
## Class :character 1st Qu.: -5.0 Class :character 1st Qu.: -13.0
## Mode :character Median : -2.0 Mode :character Median : -4.0
## Mean : 9.3 Mean : 4.9
## 3rd Qu.: 7.0 3rd Qu.: 8.0
## Max. :2402.0 Max. :2444.0
## NA's :439755 NA's :527570
## distance carrier_delay weather_delay nas_delay
## Min. : 17.0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 361.0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 628.0 Median : 1 Median : 0 Median : 2
## Mean : 799.5 Mean : 18 Mean : 3 Mean : 14
## 3rd Qu.:1033.0 3rd Qu.: 18 3rd Qu.: 0 3rd Qu.: 18
## Max. :4983.0 Max. :2402 Max. :1615 Max. :1446
## NA's :24170151 NA's :24170151 NA's :24170151
## security_delay late_aircraft_delay
## Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0
## Median : 0 Median : 5
## Mean : 0 Mean : 24
## 3rd Qu.: 0 3rd Qu.: 30
## Max. :717 Max. :1484
## NA's :24170151 NA's :24170151
The dataset has no missing value. All NAs are meaningful. For cancelled flights, cancelled
= 1. Since no delay performance data can be recorded, NAs are shown in dep_delay
, dep_del15
, arr_delay
, arr_del15
, carrier_delay
, weather_delay
, nas_delay
, security_delay
, andlate_aircraft_delay
. There are total 457603 cancelled flights in the dataset. For arr_del15
= 1, indicating flights arrived ontime, no delay performance data could be recorded either. NAs are shown in carrier_delay
, weather_delay
, nas_delay
, security_delay
, and late_aircraft_delay
. For diverted
=1, indicating flights are routed from its original arrival destination to a new arrival destination, NAs are shown in arr_delay
, arr_del15
, carrier_delay
, weather_delay
, nas_delay
, security_delay
, and late_aircraft_delay
. There are total 69967 diverted flights in the dataset.
Following analysis addresses three different aspects: major causes of flight delay, time pattern of flight delay, and the average delay time per delayed flights of the U.S. domestic airlines.
Major Causes of Flight Delay
general_bargraph_data = delay_data %>%
filter(arr_del15 == 1) %>% # filter for choosing the flight that delayed 15 mins and above(our criterion for delay flights)
select(carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay)
# Here, since we want to counts the total delayed flights in different combinations of causes, not the total delay minutes, we mutate the delay minutes in each cause as 1 (meaning "Yes" for the delay cause) , or 0 (meaning "No" for the delay cause). The Detailed illustration is shown on the output table.
general_bargraph_data$carrier_delay[general_bargraph_data$carrier_delay != 0] = 1
general_bargraph_data$weather_delay[general_bargraph_data$weather_delay != 0] = 1
general_bargraph_data$nas_delay[general_bargraph_data$nas_delay != 0] = 1
general_bargraph_data$security_delay[general_bargraph_data$security_delay != 0] = 1
general_bargraph_data$late_aircraft_delay[general_bargraph_data$late_aircraft_delay != 0] = 1
general_bargraph_data = setDT(general_bargraph_data)[,list(Count = .N) ,names(general_bargraph_data)]
# This function is for generate the name of the combination of delay causes. When the delay causes has "1" meaning "Yes", then the function asks it to return the name of the delay.
print_xaxis_1 <- function(i) {
xaxis <- character()
if(general_bargraph_data$carrier_delay[i] == 1) {
xaxis <- "carrier delay"
}
if(general_bargraph_data$weather_delay[i] == 1) {
xaxis <- paste(xaxis, "weather delay", sep = "/")
}
if(general_bargraph_data$nas_delay[i] == 1) {
xaxis <- paste(xaxis, "national aviation system delay", sep = "/")
}
if(general_bargraph_data$security_delay[i] == 1) {
xaxis <- paste(xaxis, "security delay", sep = "/")
}
if(general_bargraph_data$late_aircraft_delay[i] == 1) {
xaxis <- paste(xaxis, "late aircraft delay", sep = "/")
}
return(xaxis)
}
# the dataset "general_bargraph_data"has 31 rows. We made a vector contains all the names of the combination of delay causes.
loop <- c(1:31)
xname <- map_chr(loop, print_xaxis_1)
# combine the names of the combination of delay causes to the combination data frame that created above by the observation numbers.
general_bargraph_data <- general_bargraph_data %>%
mutate(cause_of_delay = xname)
# plot graph
cause_bar_graph <- general_bargraph_data %>%
mutate(cause_of_delay = fct_reorder(cause_of_delay, Count, .desc = TRUE)) %>%
head(10) %>%
ggplot(aes(x = cause_of_delay, y = Count, fill = cause_of_delay)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "The Total Count of Airline Delay with Different Causes",
x = "cause(s) of delay", y = "counts") +
theme(legend.background = element_rect(fill=alpha('white', 0.4)))
ggplotly(cause_bar_graph)
# Table indicates the all combinations (31) of major causes of flight delay from 2012 to 2016.
table_general_bargraph = general_bargraph_data %>%
datatable(class = "display")
table_general_bargraph
season_pattern <- delay_data %>%
select(c(1:5, 10, 11, 13, 14, 18:23)) %>%
filter(arr_del15 == 1)
agg1 = aggregate(carrier_delay ~ year + month, data = season_pattern, mean)
agg2 = aggregate(weather_delay ~ year + month, data = season_pattern, mean)
agg3 = aggregate(nas_delay ~ year + month, data = season_pattern, mean)
agg4 = aggregate(security_delay ~ year + month, data = season_pattern, mean)
agg5 = aggregate(late_aircraft_delay ~ year + month, data = season_pattern, mean)
agg = cbind(agg1, agg2[,3], agg3[,3], agg4[,3], agg5[,3]) # Combine the subsets
colnames(agg)[-c(1:3)] = colnames(season_pattern)[12:15] # rename the columns
#combine all the data from 2012-2016
agg12 = agg[seq(1,56,5),]
agg13 = agg[seq(2,57,5),]
agg14 = agg[seq(3,58,5),]
agg15 = agg[seq(4,59,5),]
agg16 = agg[seq(5,60,5),]
agg_sum = rbind(agg12,agg13,agg14,agg15,agg16)
head(agg_sum)
## year month carrier_delay weather_delay nas_delay security_delay
## 1 2012 January 16.80669 3.038543 13.88296 0.05226491
## 6 2012 February 17.89974 1.977148 11.60144 0.10147999
## 11 2012 March 17.74794 2.306027 12.00334 0.04572367
## 16 2012 April 18.67759 2.275727 10.13331 0.07437263
## 21 2012 May 17.75015 2.925335 13.00150 0.07726250
## 26 2012 June 18.95689 2.305828 12.42898 0.07533480
## late_aircraft_delay
## 1 20.52124
## 6 19.15220
## 11 22.75658
## 16 20.85180
## 21 22.01178
## 26 24.59832
#use gather and change it into a long dataset
agg_sum2 = gather(agg_sum, carrier_delay:late_aircraft_delay, "delay" , -year, -month)
agg_sum2$month = factor(agg_sum2$month)
colnames(agg_sum2)[3] = "cause_of_delay"
plot_1 = ggplot(agg_sum2, aes(month, delay, color = cause_of_delay )) +
geom_line(aes(group = cause_of_delay)) + geom_point() +
labs(title = "Month of Year View: Delay Minutes of Different Causes of Delay",
x = "",
y = "average delay minute per delayed flight")
plot_1= plot_1 + facet_wrap(~year, nrow = 5)
ggplotly(plot_1)
agg12 = agg[seq(1,56,5),]
agg16 = agg[seq(5,60,5),]
agg12_16 = rbind(agg12, agg16) %>% select(year, late_aircraft_delay)
agg_boxplot =
ggplot(agg_sum, aes(year, late_aircraft_delay, fill = year))+
geom_boxplot(aes(group = year))
agg_boxplot
descrip = with(agg12_16, tapply(late_aircraft_delay, year, summary))
df_12_16 = rbind(descrip[[1]], descrip[[2]])
rownames(df_12_16) = c("2012", "2016")
df_12_16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2012 19.15 20.87 22.18 22.75 24.61 27.56
## 2016 20.43 21.45 22.93 23.75 26.43 29.01
t.test(late_aircraft_delay ~ year, data = agg12_16, var.equal = T)
##
## Two Sample t-test
##
## data: late_aircraft_delay by year
## t = -0.93727, df = 22, p-value = 0.3588
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.217516 1.214495
## sample estimates:
## mean in group 2012 mean in group 2016
## 22.75268 23.75419
Comment: From the side-by-side boxplot, it can be seen that the spread of the boxes are close for data in 2012 and 2016, indicating that the cariances are about the same. So equal vaiance assumption can be considered true for the following test.
From the basic statistics, it also can be seen that the average late aircraft delay time in 2016 is only a little bit longer than that in 2012. We make formal hypotheses as below: Null hypothesis: the mean time delayed caused by late aircraft in 2012 is not different from that in 2016 Alternative hypothesis: the mean time delayed caused by late aircraft in 2012 is different from that in 2016.
Since the p-value of the t test is 0.3588 which is greater than 0.05, there is no enough evidence to reject the null hypothesis. The mean time delayed caused by late aircraft do not have much difference between 2012 and 2016.
month_delay <- delay_data %>%
select(year, month, day_of_month, arr_delay, arr_del15) %>% ## select needed variable for this plot
filter(arr_del15 == 1) %>% ## filter late arrival flights
group_by(month, day_of_month) %>%
summarize(mean_delay = (mean(arr_delay))) %>% ## summarize mean of delay (delay minute per flight)
ggplot(aes(x = day_of_month, y = mean_delay, color = month)) +
geom_point() +
geom_line() +
labs(title = "Day of Month View: Average delay",
x = "day of month",
y = "average delay minute per delayed flight")
ggplotly(month_delay)
From the five graphs we know that the patterns among 2012-2016 are alike across months and types of delay causes. Each year, it seems like the average delay time across month has a tendency to increase from Jan to Jun and peaks at the end of July/the beginning of Aug, then it starts to decrease but it reaches to another small peak again at the end of the year.
week_delay <- delay_data %>%
mutate(day_of_week = factor(day_of_week, labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) %>% ## change day_of_week from number to the name each day of a week
select(year, day_of_week, arr_delay, arr_del15, carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay) %>% ## select needed variable for this plot
filter(arr_del15 == 1) %>% ## filter late arrival flights
group_by(day_of_week) %>%
summarize(mean_carrier_delay = mean(carrier_delay),
mean_weather_delay = mean(weather_delay),
mean_nas_delay = mean(nas_delay),
mean_security_delay = mean(security_delay),
mean_late_aircraft_delay = mean(late_aircraft_delay)) %>% ## summarize mean of delay for each cause(delay minute per flight)
plot_ly(x = ~day_of_week, y = ~mean_carrier_delay, type = 'bar', name = 'carrier delay', alpha = 0.8) %>%
add_trace(y = ~mean_weather_delay, name = 'weather delay') %>%
add_trace(y = ~mean_nas_delay, name = 'national aviation system delay') %>%
add_trace(y = ~mean_security_delay, name = 'security delay') %>%
add_trace(y = ~mean_late_aircraft_delay, name = 'late aircraft delay') %>%
layout(title = "Weekly view: Delay Minutes of Different Causes of Delay",
yaxis = list(title = "average delay minute per delayed flight"),
barmode = 'stack')
week_delay
hour_delay = delay_data %>%
select(year, dep_time, arr_delay, arr_del15) %>%
mutate(dep_time = recode(dep_time, "24" = "00")) %>% ## change 24:00 to 00:00
group_by(dep_time) %>%
summarize(prop = sum(arr_del15, na.rm = T)/ n(), ## summarize the proportion of flights delay depends on departure time
mean_delay = mean(arr_delay[arr_del15 == 1], ## summarize mean of delay for each hour of a day for delayed flights
na.rm = T)) %>%
mutate(text_label = str_c("delay minute:", round(mean_delay, 2), "min")) %>%
plot_ly(x = ~dep_time,
y = ~prop,
type = 'scatter',
mode = 'markers',
size = ~mean_delay, ##buble size = mean_delay
sizes = c(10, 60), ## bubble size range
marker = list(opacity = 0.5, sizemode = 'diameter'),
text = ~text_label) %>%
layout(title = "24H View: Proportion of Delay & Delay Time in Minute",
xaxis = list(title = "departure time across 24 Hours"),
yaxis = list(title = "delay proportion"))
hour_delay
The Average Delay Time per Delayed Flights of the U.S. Domestic Airlines
general_boxplot = delay_data %>%
filter(arr_del15 == 1) %>% # filter for choosing the flight that delayed 15 mins and above(our criterion for delay flights)
select(unique_carrier, carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay) %>%
group_by(unique_carrier) %>%
# get the summary data: the mean of delay minutes per delayed flights in different causes. The sum_mean_delay is used for reordering the factors of the following plot.
summarize(mean_carrier_delay = mean(carrier_delay[carrier_delay != 0]),
mean_weather_delay = mean(weather_delay[weather_delay != 0]),
mean_nas_delay = mean(nas_delay[nas_delay != 0]),
mean_security_delay = mean(security_delay[security_delay != 0]),
mean_late_aircraft_delay = mean(late_aircraft_delay[late_aircraft_delay!= 0]),
sum_mean_delay = sum(mean_carrier_delay,
mean_weather_delay,
mean_nas_delay,
mean_security_delay,
mean_late_aircraft_delay)) %>%
# since some airline do no have the record for security delay, we made it to make the mean of delay minute in security delay is 0. Thus, the security delay minutes of these airlines will not shown the graph below.)
mutate(mean_security_delay = ifelse(is.nan(mean_security_delay), 0, mean_security_delay)) %>%
mutate(unique_carrier = reorder(unique_carrier, sum_mean_delay)) %>%
plot_ly(y = ~unique_carrier,
x = ~mean_carrier_delay,
type = 'bar',
orientation = 'h',
name = 'carrier delay',
alpha = 0.8) %>%
add_trace(x = ~mean_weather_delay, name = 'weather delay') %>%
add_trace(x = ~mean_nas_delay, name = 'national aviation system delay') %>%
add_trace(x = ~mean_security_delay, name = 'security delay') %>%
add_trace(x = ~mean_late_aircraft_delay, name = 'late aircraft delay') %>%
layout(title = "Average Delay Minutes (per delayed flights) with Different Causes in Domestic Airlines",
yaxis = list(title = "domestic airlines"),
xaxis = list(title = "delay minute per delayed flight"), barmode = 'stack')
general_boxplot
* `FL` = AirTran Airways Corporation
* `F9` = Frontier Airlines Inc.
* `DL` = Delta Air Lines Inc.
* `YV` = Mesa Airlines
* `EV` = ExpressJet Airline Inc.
* `OO` = SkyWest Airlines Inc.
* `9E` = Endeavor Air Inc.
* `UA` = United Air Lines Inc.
* `NK` = Spirit Air Lines
* `VX` = Virgin America
* `B6` = JetBlue Airways
* `AS` = Alaska Airlines Inc.
* `AA` = American Airline Inc.
* `MQ` = Envoy Air
* `US` = US Airways Inc.
* `WN` = Southwest Airlines
* `HA` = Hawaiian Airlines Inc.
The graph displays the average of delay minutes per flight with different major causes for the U.S. domestic airlines. Among 17 airlines that the data recorded, Delta Air Lines has the highest average delay minutes per flight among the delayed flights (average delay 200.4493 minutes per delayed flight in the sum of all causes). Mesa Airlines, ExpressJet Airlines Inc., SkyWest Airlines Inc., and Endeavor Air Inc. also have high average of delay minutes per flight among the delayed flights.
Additionally, all five causes are varied in each airline. From the graph, there are two airlines, AirTran Airways Corporation and Frontier Airlines Inc., do not have delayed flights caused by security delay. Delta Air Lines also has the highest average delay minutes per delayed flight caused by security delay. For NAS system delay, late aircraft delay, and carrier delay, Virgin America, Spirit Air Lines, Mesa Airlines have the highest average delay minutes per delayed flights, respectively. For weather delay, it seems that there is no huge difference in weather delay among all airlines. Mesa Airlines has the largest average delay minutes per delayed flights affected by weather; meanwhile, the weather delay has the least effect on Virgin America. Also, all five causes are varied in each airline.
The preceding analysis informed our understanding of the major causes of flight delay, time pattern of flight delay, and the average delay time per delayed flights of U.S. domestic airlines.
Major Causes of Flight Delay
Based on the result, we know that there are 31 combinations of the major causes of flight delay. This graph and table help us to observe how many delayed flight are caused by the combinations of different major causes. Originally, we did not anticipate that NAS delay is the largest cause. After reviewing extra articles, our result parallels with the article, ‘An Answer to Flight Delays?’ by Tracy Samantha. A significant reason for flight delays breakdowns within the National Aviation System (NAS), which includes airports and the air traffic control centers. Since the NAS delay includes issues at the air traffic control centers, the Flow control program, and other significant and complex factors, when the issues happened, a number of flight will be affected and might have delay.
Time Pattern of Flight Delay
From the five graphs we know that the patterns among 2012-2016 are alike across months and types of delay causes. Each year, it seems like the average delay time across month has a tendency to increase from Jan to Jun and peaks at the end of July/the beginning of Aug, then it starts to decrease but it reaches to another small peak again at the end of the year.\
There might be two possible reasons could explain that security delay on weekends is almost as twice as the delay on weekdays. The first one is passenger flow volume is higher in weekend than in weekdays, resulting in an shortage of security staff; the second one is the some staff have days off on weekends, leading to longer security time. However, although the delay time is twice in weenkends than in weekdays, the difference is only about 0.5min per delayed flight, which is very slightly. While viewing this from relative scale, the 2 times is worth assessing. In general, there’s no difference in delay time across a week.\
From the 24 hours graph, we can tell that if the fligh departs from 00 to 03, almost half of the flight will be delayed, and the delayed time is very long. At 04, the likelihood of flight delay decrease dramatically, however, if the flight delays, it will delay for a long time. The delay proportion and delay time are both small at 05, and gradually increase until 23. It would be better to avoid taking flights departing from 00 to 04.
The Average Delay Time per Delayed Flights of the U.S. Domestic Airlines
As the result shown, Delta Air Lines has the highest average delay minutes per flight among the delayed flights (average delay 200.4493 minutes per delayed flight in the sum of all causes), especially in the security delay. 200.4493 minutes is a huge number for delay minutes. It might be caused by some outliers that has extremely large delay minutes and the reason is needed to be further investigate. It could be a typo and if not ,it will be interesting to explore what factors cause these huge delay. For example, there is a piece of new showing that “A Delta Flight Was Delayed for 2 Hours Because a Pilot and Flight Attendant Had an Argument” (www.Fortune.com). In terms of security delay, we noticed that AirTran Airways Corporation and Frontier Airlines Inc. do not have security delay on the graph, nor the dataset. It might be the true that these two airline barely experience delay due to security within the five years. However, it might not be ideal in the real world. One possible explanation could be that the data might be not recorded in the dataset. We did not predict that Delta Air Lines, Virgin America, Spirit Air Lines, and Mesa Airlines have high delay minutes in each delay causes separately. It will be interesting to do further study to evaluate how do the delay causes affect delayed flights in these airlines.
Overall, our study has some limitations mainly due to lacking some other factors or information that are not recorded in the dataset. In the major causes of flight delay analysis, NAS delay has the highest frequency over the five years, but the lack of strong information to support this result limits our evaluation on finding possible factors or reasons, which contribute to the findings. In the time pattern of delay flights analysis, the information about passenger flow might be a confounder of the relationship between the average delay minutes per delayed flights and the holiday seasons. With the information of passenger flow, we might have better data and perform further formal statistical analysis. Lastly, when we explored the the average delay time per delayed flights of different U.S. airlines, having more information about the outliers in each delay cause helps us to find better trends of delay time in airline data and corresponding explanations to the outliers.
Ball, Michael, Cynthia Barnhart, Martin Dresner, Mark Hansen, Kevin Neels, Amedeo Odoni, Everett Peterson, Lance Sherry, Antonio Trani, and Bo Zou. Total Delay Impact Study: A Comprehensive Assessment of the Costs and Impacts of Flight Delay in the United States. Berkeley, Calif.: Institute of Transportation Studies, University of California, Berkeley, 2010. Internet resource.
Bennett, J. (2017, November 14). The 12 Biggest U.S. Airlines Ranked by Percentage of Delayed Flights.
Retrieved from http://webcache.googleusercontent.com/search?q=cache%3A8VgN4tzU0jsJ%3Awww.ppularmechanics.com%2Fflight%2Fg2303%2Fairlines-with-most-delays%2F%2B&cd=1&hl=en&ct=clnk&gl=us
Borenstein, S., & Koenig, D. (n.d.). Science Says: Why some airplanes don’t fly in high heat.
Retrieved December 06, 2017, from https://phys.org/news/2017-06-science-airplanes-dont-high.html
Bureau of Transportation Statistics. TranStats Database. http://www.transtats.bts.gov/
Here are the 3 most common reasons why your flight is delayed
Benjamin Zhang - http://www.businessinsider.com/why-your-flight-delayed-2016-12
HOPEWELL, D. (2017, November 14). Best and Worst Airlines for Flight Delays.
Retrieved from http://www.travelandleisure.com/slideshows/best-and-worst-airlines-for-delays-2014#no 14-frontier
Reilly , (n.d.). A Delta Flight Was Delayed for 2 Hours Because a Pilot and Flight Attendant Had an Argument.
Retrieved from http://fortune.com/2017/07/25/delta-flight-delay-pilot-argument/
Reuters , J. (n.d.). Delta Passengers react to Flight Delays Due to Power Outage | Money.
Retrieved from http://time.com/money/4443670/delta-passengers-flight-delays/
Schmidt/Washington, T. S. (2007, August 15). An Answer to Flight Delays?
Retrieved December 06, 2017, from http://content.time.com/time/nation/article/0,8599,1653304,00.html
The 12 Biggest U.S. Airlines Ranked by Percentage of Delayed Flights
Jay Bennett - http://www.popularmechanics.com/flight/g2303/airlines-with-most-delays/
Our project was performed on high performance VPS (Special thanks to Wei Hao)