Skip to content

ctnguynn/data-mini-3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Data-Mini-3

Welcome to Data Mini #3! You and your teammates will have from 11 am to 5:30 pm on Saturday, April 4th, to explore and submit a presentation about your data analysis.

This Git repository contains this README file and the dataset in a CSV format. If you are a Duke student, we recommend you use Duke's Container Manager to access your preferred IDE.

How to Participate

  1. Fork the repository emailed to you by trisha.iyer@duke.edu
  2. Load the data from Injury_Illness_Summary_-_Casualty_Data.zip: Click on the file, click “View Raw”, and download.
  3. Come up with a question to answer with data analysis. Analyze the dataset and prepare a presentation of your findings.
  4. Submit your files to your team’s Google Drive folder. Submission materials should include: your code notebook file, presentation media, citations to all outside resources, and a video file/link to the presentation.
  5. Present! You must fit your presentation within a 5-minute timeframe. Please do not attempt to push changes to this repository - it is read-only for participants.

The analysis period will run from ~11:30 am to 4:30 pm. You and your team may stay in the room, Wilkinson 126, or leave during the analysis period (to get food, analyze in another room, etc.). You and your team must be back in Wilkinson 126 by 4:30 pm to submit your presentation materials.

Data Dictionary

About

The dataset you will be working with is an Injury/Illness Summary - Casualty Data. The dataset is updated every month and released by the Department of Transportation. Every observation records one railroad-related casualty from the last 6 years; the DoT defines casualties as "reportable fatalities, or illnesses arising from the operation of a railroad." Each observation includes, among other details, information about the railroad name and code, date, time, state, and type of person. The dataset includes 37,600 records across 70 columns and uniquely identifies observations by the Incident Number column.

Data Source: Department of Transportation (DoT)

Last Updated: April 1, 2026

Update Frequency: Monthly

Link: https://data.transportation.gov/Railroads/Injury-Illness-Summary-Casualty-Data-Form-55a-6-Ye/bx7m-yn3v/about_data

Data Variables

Each row represents a single casualty.

variable data_type description
railroad_code text Unique code identifying the reporting railroad
railroad_name text Full name of the reporting railroad
pdf_report text Link or reference to the associated PDF incident report
incident_number text Unique number assigned to the casualty incident
incident_year numeric Year in which the casualty incident occurred
incident_month numeric Month in which the casualty incident occurred
incident_day numeric Day of the month in which the casualty incident occurred
date timestamp Full date of the casualty incident
time text Time of day the casualty incident occurred
county_code text Code identifying the county where the incident occurred
county_name text Name of the county where the incident occurred
state_code text Code identifying the state where the incident occurred
state_name text Name of the state where the incident occurred
type_of_person_code text Code identifying the classification of the person involved (Classes A–J)
type_of_person text Description of the person classification (e.g., Worker on Duty–Railroad Employee, Trespasser, Passenger)
employee_job_code text Code identifying the job type of the railroad employee involved, if applicable
employee_job_description text Description of the employee's job role, if applicable
age_of_person numeric Age of the person involved in the casualty incident
positive_alcohol_tests numeric Number of positive alcohol tests recorded in connection with the incident
positive_drug_tests numeric Number of positive drug tests recorded in connection with the incident
injury_illness_code text Code identifying the type of injury or illness sustained
nature_of_injury text Description of the nature of the injury or illness sustained
location_of_injury_on_body text Body part or region where the injury was sustained
specific_location text More granular description of the injury location on the body
injury_illness text General classification of whether the casualty is an injury or illness
physical_act_circumstances_code text Code describing the physical act or circumstance contributing to the injury
physical_act_circumstances text Description of the physical act or circumstance contributing to the injury
general_location_of_person_code text Code identifying the general physical location of the person at time of incident
general_location_of_person text Description of the general physical location of the person at time of incident
on_track_equipment_code text Code identifying the type of on-track equipment involved, if applicable
on_track_equipment text Description of the on-track equipment involved, if applicable
specific_location_of_person_code text Code identifying the specific location of the person relative to railroad operations at time of incident
specific_location_of_person text Description of the specific location of the person at time of incident
event_code text Code identifying the event or activity the person was engaged in at time of incident
event text Description of the event or activity the person was engaged in at time of incident
tools_code text Code identifying any tools involved in the incident, if applicable
tools text Description of any tools involved in the incident, if applicable
injury_cause_code text Code identifying the cause of the injury or illness
injury_cause text Description of the cause of the injury or illness
days_away_from_work numeric Number of days the injured/ill person was away from work as a result of the incident
days_restricted_activity numeric Number of days the person had restricted work activity as a result of the incident
hazmat_exposure text Indicator of whether the casualty involved hazardous material exposure
covered_data_code text Code indicating whether certain data fields are withheld from public release
covered_data_reason text Explanation for why certain data fields are withheld from public release
latitude numeric Latitude coordinate of the incident location
longitude numeric Longitude coordinate of the incident location
narrative text Free-text description of the circumstances surrounding the casualty incident
employee_suspension text Indicator of whether the involved employee was suspended in connection with the incident
district text Railroad operating district where the incident occurred
fatality text Indicator of whether the casualty resulted in a fatality
form_57_filed text Indicator of whether a highway-rail grade crossing accident report (Form 6180.57) was filed for this incident
form_54_filed text Indicator of whether a rail equipment accident report (Form 6180.54) was filed for this incident
class_code text Code identifying the classification of the person involved (corresponds to Classes A–J)
class text Description of the person classification corresponding to the class code
casualty_occurrence_code text Code describing the circumstances under which the casualty occurred
equipment_movement_code text Code describing the movement status of on-track equipment at the time of the incident
report_key text Unique key identifying the associated accident or incident report
reporting_railroad_smt_grouping text SMT (Safety Management Team) grouping assigned to the reporting railroad by the FRA
reporting_parent_railroad_code text Code identifying the parent railroad of the reporting railroad
reporting_parent_railroad_name text Name of the parent railroad of the reporting railroad
reporting_railroad_holding_company text Name of the holding company that owns the reporting railroad
geocode text Geographic code associated with the incident location
incident_key text Unique identifier for the incident, constructed from railroad code, incident number, incident year, and incident month
reporting_railroad_individual_class text FRA size classification of the reporting railroad as an individual entity
reporting_railroad_passenger text Indicator of whether the reporting railroad operates passenger service
reporting_railroad_commuter text Indicator of whether the reporting railroad operates commuter service
reporting_railroad_switching_terminal text Indicator of whether the reporting railroad operates as a switching and terminal railroad
reporting_railroad_tourist text Indicator of whether the reporting railroad operates tourist service
reporting_railroad_freight text Indicator of whether the reporting railroad operates freight service
reporting_railroad_short_line text Indicator of whether the reporting railroad is classified as a short line railroad

Competition Policies

Code of Conduct

AI and Online Sources

  • You may not put this data into an LLM.
  • Analysis in this competition should reflect your own work; although we cannot prohibit AI, you should not depend on it in your analysis. Obvious or extreme usage of LLMs will be considered in judging.
  • Your team must include citations for forums (Stack Overflow), textbooks, and other references.
  • You must provide a link to any AI chat conversations used.
  • You must list all citations (including the original study’s citation found at the bottom of this README) at the end of your code notebook. Failure in doing so will result in a large point deduction during judging.
  • You may reuse code from your own research or class projects.
  • Citations should be listed in the code notebook file, presentation media, or a separate document.

Analysis

  • You must answer a question that your team chooses.
  • All types of data modeling are allowed.
  • All coding languages are allowed.
  • If previous research has been done on this topic, you may not copy the study.

Submission Requirements

  • You and your team should submit the following to your Google Drive folder:
  • Your code notebook file(s) (QMD, R script, Jupyter, etc.)
  • Presentation media (slides, PDF, link to an HTML, visualizations, etc)
  • Citations to all outside resources: please read the Code of Conduct for more information
  • A link to your forked GitHub repository
  • Your team’s Google Drive folder contents will not be accessible to view, comment on, or edit by any other team.

Judging

  • Winners will be announced by April 11th.
  • As stated earlier, you must come up with a question that your team will use to guide your statistical analysis.
  • A fair evaluation will be justified by the Statistical Science Majors Union's Competitions and Opportunities Committee.
  • The judging criteria is as follows:
  1. Communication: Clarity of presentation, logical structure, and ability to explain findings to a broad audience. - 20%
  2. Creativity & Insight: Depth of analysis, creativity in approach, and ability to identify meaningful patterns. - 40%
  3. Reproducibility: Well-organized code, clear documentation, and transparent workflows. - 5%
  4. Visualization & Design: Effective use of charts and visuals to support your conclusions. - 25%
  5. Technical Soundness: Appropriate use of statistical methods, reasonable assumptions, and valid interpretations. - 10%
  • This will be prioritized more than the experience level of the statistical methods your team members used. This will account for differing experience levels across competitors

Dataset Acknowledgment

This project utilizes the Injury/Illness Summary - Casualty Data dataset, maintained and publicly released by the Department of Transportation through <data.transportation.gov>. We are grateful to the Department of Trasnportation for making this valuable public resource freely available for research and analysis.

Citation

Federal Railroad Administration. (2026). Injury/Illness Summary - Casualty Data (Form 55a) - 6 Year View [Data set]. U.S. Department of Transportation. Retrieved from https://data.transportation.gov/dataset/Injury-Illness-Summary-Casualty-Data-Form-55a-6-Ye/rash-pd2d/about_data

This citation must be included at the end of your code notebook.

About This Event

Data Mini #3 was organized by Duke's Statistical Science Majors Union - Competitions and Opportunities Committee. Contributing members of the committee and the SSMU executive team are listed below:

  • Hanna Chee, ‘29
  • Trisha Iyer, ‘28
  • Hyunjin Lee, ‘27
  • Susan Li, ‘29
  • Phillip Lin, '27
  • Liane Ma, ‘27
  • Chelsea Nguyen, ‘28
  • Cooper Ruffing, ‘27
  • Derek Wang, ‘28
  • Amy Xu, ‘27
  • Allison Yang, ‘27
  • And an additional thank-you goes out to the professors who helped advertise the event.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors