Airline Delays and Weather

Building a data lake and distributed pipeline to predict flight delays using weather data

Description

The goal of this project is to understand flight departure delays given information about the flight and weather conditions at the origin and destination. Flight delays are common, however, their exact cause is a result of many factors. The key is identify what these features may be and create a model that can accurately predict whether or not a flight will be delayed by at least 15 minutes. At a high level, this project involved developing multiple models, hyperparameter turning, and cross-validation to get the final result.

Over five years of flight and weather data in the United States was used for this project, so much of the focus was on developing a data lake and pipeline that would enable the data analysis and modeling necessary to achieve the goal of accurately predicting flight delays.

Techniques

  • data engineering
  • data lakes
  • distributed computing
  • feature engineering
  • machine learning

Tools

  • Spark
  • Databricks
  • Delta Lake
  • AWS

Outcome

Developed a scalable pipeline for predicting flight delays due to weather. Streamed and joined together a massive (terabytes) flight and weather dataset.

More Information

More information can be found at the following links:

GitHub Repository: https://github.com/nsylva/w261_final_project