Skip to content

csc316-project/DataSets

Repository files navigation

DataSets

Compilation of datasets we will use in our project.


File Name: us_flights_2025

File Source: us_flights_2025.csv

Dataset Description: Reporting carriers are required to (or voluntarily) report on-time data for flights they operate: on-time arrival and departure data for non-stop domestic flights by month and year, by carrier and by origin and destination airport. Includes scheduled and actual departure and arrival times, canceled and diverted flights, taxi-out and taxi-in times, causes of delay and cancellation, air time, and non-stop distance.

  • FL_DATE: Date of the flight (yyyymmdd).
  • OP_UNIQUE_CARRIER: Unique carrier code assigned by the DOT for the operating airline.
  • OP_CARRIER_FL_NUM: Operating carrier flight number.
  • ORIGIN_AIRPORT_ID: Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport.
  • ORIGIN_AIRPORT_SEQ_ID: Sequential ID for the origin airport (used internally by DOT).
  • ORIGIN_CITY_MARKET_ID: Origin Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market.
  • ORIGIN: Origin airport FAA code.
  • DEST_AIRPORT_ID: Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport.
  • DEST_AIRPORT_SEQ_ID: Sequential ID for the destination airport (used internally by DOT).
  • DEST_CITY_MARKET_ID: City market identifier for the destination airport.
  • DEST: Destination airport FAA code.
  • CRS_DEP_TIME: CRS Departure Time (local time: hhmm)
  • DEP_TIME: Actual Departure Time (local time: hhmm)
  • DEP_DELAY: Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.
  • DEP_DELAY_NEW: Difference in minutes between scheduled and actual departure time. Early departures set to 0.
  • CRS_ARR_TIME: CRS Arrival Time (local time: hhmm)
  • ARR_TIME: Actual Arrival Time (local time: hhmm)
  • ARR_DELAY: Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
  • ARR_DELAY_NEW: Difference in minutes between scheduled and actual arrival time. Early arrivals set to 0.
  • CANCELLED: Cancelled Flight Indicator (1=Yes)
  • CANCELLATION_CODE: Specifies The Reason For Cancellation
    • A = Carrier
    • B = Weather
    • C = National Air System
    • D = Security
  • CRS_ELAPSED_TIME: CRS Elapsed Time of Flight, in Minutes
  • ACTUAL_ELAPSED_TIME: Elapsed Time of Flight, in Minutes
  • AIR_TIME: Flight Time, in Minutes
  • DISTANCE: Distance between airports (miles)
  • CARRIER_DELAY: Delay in minutes due to the carrier.
  • WEATHER_DELAY: Delay in minutes due to weather.
  • NAS_DELAY: Delay in minutes due to the National Air System.
  • SECURITY_DELAY: Delay in minutes due to security issues.
  • LATE_AIRCRAFT_DELAY: Delay in minutes due to late arrival of the aircraft.

File Name: StatsCanAirport.csv

File Source: StatsCan

Dataset Description: This dataset contains annual air passenger traffic counts for major Canadian airports and for Canada as a whole. The data measures the total number of passengers enplaned and deplaned at each airport, with further breakdowns by flight sector (domestic, transborder, and other international). The dataset covers the years 2020 through 2024 and reflects changes in air travel demand over time, including the impact of COVID-19 recovery.

  • Geography: String representing the geographical region and/or airport name.
  • Air Passenger Traffic: String representing Category of passenger traffic.
    • Total Domestic Sector
    • Transborder Sector
    • Other International Sector
  • Year: Integer representing the year the data is from.
    • 2020
    • 2021
    • 2022
    • 2023
    • 2024
  • Other: Values marked with an 'x' are suppressed to meet confidentiality requirements of the Statistics Act.

File Name: flights-review-data

File Source: Kaggle - Airlines Customer Sentiment Dataset

Dataset Description: This Dataset consist of 5 tables including Customers ,Bookings,Flights,Reviews and safety log.

  • customer table stores information about customers their name , contact and many more.

  • Booking table consist of booking of the airlines like booking_date , number of booking , from_airport and many more.

  • flights table stores the data of the airlines .

  • reviews table stores the sentiments of users after using particular flight .

  • safety_log table stores the information about the incidents the particular airline of flight faces.

  • bookings1.csv: Contains booking details including fare and duration.

    • booking_id: Unique identifier for the booking.
    • from_city, to_city: Origin and destination.
    • fare: Ticket price.
    • duration: Flight duration in hours.
  • customers1.csv: Customer demographic information.

    • customer_id: Unique identifier.
    • customer_name, date_of_birth, city, contact.
  • flights1.csv: Operational details of flights.

    • flight_id: Unique flight identifier.
    • airline: Name of the airline.
    • class: Flight class (Economy, Business, Premium).
    • dep_time, arr_time: Departure and arrival timestamps.
  • reviews1.csv: Customer feedback and ratings.

    • overall_rating: Rating out of 10.
    • seat_comfort, cabin_staff_service, food_and_beverages: Specific aspect ratings.
    • Links booking_id and flight_id.
  • safety_log1.csv: Records of safety incidents.

    • incident_type, severity_level, passenger_impact.

Visualizations Generated:

  1. Airline Ratings: Boxplot comparing overall ratings across airlines.
  2. Fare vs Comfort: Scatter plot analyzing the relationship between ticket price and seat comfort.
  3. Duration vs Satisfaction: Heatmap showing the distribution of ratings across different flight durations.

File Name: airline_delays

File Source: Kaggle - Flight Delay Data

Dataset Description: This dataset provides detailed information on flight arrivals and delays for U.S. airports, categorized by carriers. The data includes metrics such as the number of arriving flights, delays over 15 minutes, cancellation and diversion counts, and the breakdown of delays attributed to carriers, weather, NAS (National Airspace System), security, and late aircraft arrivals. Explore and analyze the performance of different carriers at various airports during this period. Use this dataset to gain insights into the factors contributing to delays in the aviation industry.

  • year: The year of the data.
  • month: The month of the data.
  • carrier: Carrier code.
  • carrier_name: Carrier name.
  • airport: Airport code.
  • airport_name: Airport name.
  • arr_flights: Number of arriving flights.
  • arr_del15: Number of flights delayed by 15 minutes or more.
  • carrier_ct: Carrier count (delay due to the carrier).
  • weather_ct: Weather count (delay due to weather).
  • nas_ct: NAS (National Airspace System) count (delay due to the NAS).
  • security_ct: Security count (delay due to security).
  • late_aircraft_ct: Late aircraft count (delay due to late aircraft arrival).
  • arr_cancelled: Number of flights canceled.
  • arr_diverted: Number of flights diverted.
  • arr_delay: Total arrival delay.
  • carrier_delay: Delay attributed to the carrier.
  • weather_delay: Delay attributed to weather.
  • nas_delay: Delay attributed to the NAS.
  • security_delay: Delay attributed to security.
  • late_aircraft_delay: Delay attributed to late aircraft arrival.

File Name: most_luggage_lost_by_airline_country.csv

File Source: LuggageLosers

Dataset Description: Reports estimated live rankings (as of Jan. 21, 2025) of the most luggage lost by a country's airlines. Estimations are based on people talking about losing their luggage on social media since airlines do not post live lost luggage data.

  • ranking: The position of the country in terms of luggage loss/delay.
  • country: The name of the country.
  • probability_of_loss_or_delay: The estimated probability of losing your luggage.
  • luggage_score: Scored out of 5 based on its AirlineList.com rating.
  • complaints: The number of complaints in the last 30 days.
  • lost_bags: The number of lost bags in the last 30 days.

File Name: canada_aircraft_movements_2019-2024.csv

File Source: Statistics Canada (StatsCan)

Dataset Description: This dataset contains detailed records of aircraft movements at Canadian airports from 2019 to 2024. Aircraft movements include both takeoffs and landings, and are categorized by type of operation, airport, and time period. The dataset provides insight into aviation activity levels across Canada, including trends before, during, and after the COVID-19 pandemic.

The data distinguishes between itinerant movements (commercial, charter, and cross-airport flights) and local movements (training flights and circuits near an airport), enabling analysis of how airport usage and operational focus varies by location and over time.

This dataset supports analysis of airport traffic patterns, seasonal variation, and operational differences between major commercial hubs and smaller regional airports.

  • REF_DATE: Time period of the observation (monthly or yearly format depending on aggregation).

  • GEO: Geographic region or airport name (e.g., Toronto / Lester B. Pearson International Airport).

  • Airports: Name of the airport where aircraft movements were recorded.

  • Class of operation:
    Type of aircraft movement:

    • Itinerant movements
    • Local movements
    • Total, itinerant and local movements
  • VALUE: Number of aircraft movements recorded for the given airport, time period, and operation class.

  • UOM: Unit of measure (number of movements).

  • STATUS: Indicates data quality or confidentiality adjustments.

  • SYMBOL: Statistical symbols used by Statistics Canada (e.g., “x” indicates suppressed values).

Notes:

  • Values marked with “x” are suppressed to meet confidentiality requirements under the Statistics Act.
  • Each aircraft takeoff or landing counts as one movement.
  • The dataset captures significant disruptions in aviation activity during 2020–2021 and recovery trends in subsequent years.

File Name: plane_crashes.csv

File Source: Bureau of Aircraft Accidents Archives Kaggle Source: https://www.kaggle.com/datasets/abeperez/historical-plane-crash-data

Dataset Description: This dataset contains a comprehensive historical record of aviation accidents globally from 1918 to 2022. Extracted via web scraping and preprocessed for organization, the data provides a longitudinal view of aviation safety. It includes details on crash locations, aircraft operators, passenger counts, and fatalities. This dataset is ideal for analyzing long-term trends in aviation safety, the impact of mitigation efforts over the decades, and forecasting future safety benchmarks.

  • Date: The date the accident occurred.
  • Time: The local time of the crash (where recorded).
  • Location: The geographic location, city, or country of the incident.
  • Operator: The airline, military branch, or private entity operating the aircraft.
  • Flight #: The specific flight number assigned to the aircraft.
  • Route: The intended flight path (e.g., origin to destination).
  • Type: The make and model of the aircraft involved.
  • Registration: The aircraft's unique registration identifier.
  • Aboard: Total number of people on the aircraft, including passengers and crew.
  • Fatalities: Total number of lives lost in the crash.
  • Ground: Number of deaths occurring on the ground as a result of the crash.
  • Summary: A descriptive narrative of the circumstances and causes of the accident.

Notes:

  • Format Note: This dataset is often handled in Excel format rather than standard CSV to preserve encoding for complex location strings and detailed summary text.
  • Historical Context: The data captures over a century of records, reflecting the evolution of flight technology and safety standards from the early 20th century to the modern era.

About

Compilation of datasets we will use in our project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors