Welcome to Data Mini #3! You and your teammates will have from 11 am to 5:30 pm on Saturday, April 4th, to explore and submit a presentation about your data analysis.
This Git repository contains this README file and the dataset in a CSV format. If you are a Duke student, we recommend you use Duke's Container Manager to access your preferred IDE.
- Fork the repository emailed to you by trisha.iyer@duke.edu
- Load the data from
Injury_Illness_Summary_-_Casualty_Data.zip: Click on the file, click “View Raw”, and download. - Come up with a question to answer with data analysis. Analyze the dataset and prepare a presentation of your findings.
- Submit your files to your team’s Google Drive folder. Submission materials should include: your code notebook file, presentation media, citations to all outside resources, and a video file/link to the presentation.
- Present! You must fit your presentation within a 5-minute timeframe. Please do not attempt to push changes to this repository - it is read-only for participants.
The analysis period will run from ~11:30 am to 4:30 pm. You and your team may stay in the room, Wilkinson 126, or leave during the analysis period (to get food, analyze in another room, etc.). You and your team must be back in Wilkinson 126 by 4:30 pm to submit your presentation materials.
The dataset you will be working with is an Injury/Illness Summary - Casualty Data. The dataset is updated every month and released by the Department of Transportation. Every observation records one railroad-related casualty from the last 6 years; the DoT defines casualties as "reportable fatalities, or illnesses arising from the operation of a railroad." Each observation includes, among other details, information about the railroad name and code, date, time, state, and type of person. The dataset includes 37,600 records across 70 columns and uniquely identifies observations by the Incident Number column.
Data Source: Department of Transportation (DoT)
Last Updated: April 1, 2026
Update Frequency: Monthly
Each row represents a single casualty.
| variable | data_type | description |
|---|---|---|
| railroad_code | text | Unique code identifying the reporting railroad |
| railroad_name | text | Full name of the reporting railroad |
| pdf_report | text | Link or reference to the associated PDF incident report |
| incident_number | text | Unique number assigned to the casualty incident |
| incident_year | numeric | Year in which the casualty incident occurred |
| incident_month | numeric | Month in which the casualty incident occurred |
| incident_day | numeric | Day of the month in which the casualty incident occurred |
| date | timestamp | Full date of the casualty incident |
| time | text | Time of day the casualty incident occurred |
| county_code | text | Code identifying the county where the incident occurred |
| county_name | text | Name of the county where the incident occurred |
| state_code | text | Code identifying the state where the incident occurred |
| state_name | text | Name of the state where the incident occurred |
| type_of_person_code | text | Code identifying the classification of the person involved (Classes A–J) |
| type_of_person | text | Description of the person classification (e.g., Worker on Duty–Railroad Employee, Trespasser, Passenger) |
| employee_job_code | text | Code identifying the job type of the railroad employee involved, if applicable |
| employee_job_description | text | Description of the employee's job role, if applicable |
| age_of_person | numeric | Age of the person involved in the casualty incident |
| positive_alcohol_tests | numeric | Number of positive alcohol tests recorded in connection with the incident |
| positive_drug_tests | numeric | Number of positive drug tests recorded in connection with the incident |
| injury_illness_code | text | Code identifying the type of injury or illness sustained |
| nature_of_injury | text | Description of the nature of the injury or illness sustained |
| location_of_injury_on_body | text | Body part or region where the injury was sustained |
| specific_location | text | More granular description of the injury location on the body |
| injury_illness | text | General classification of whether the casualty is an injury or illness |
| physical_act_circumstances_code | text | Code describing the physical act or circumstance contributing to the injury |
| physical_act_circumstances | text | Description of the physical act or circumstance contributing to the injury |
| general_location_of_person_code | text | Code identifying the general physical location of the person at time of incident |
| general_location_of_person | text | Description of the general physical location of the person at time of incident |
| on_track_equipment_code | text | Code identifying the type of on-track equipment involved, if applicable |
| on_track_equipment | text | Description of the on-track equipment involved, if applicable |
| specific_location_of_person_code | text | Code identifying the specific location of the person relative to railroad operations at time of incident |
| specific_location_of_person | text | Description of the specific location of the person at time of incident |
| event_code | text | Code identifying the event or activity the person was engaged in at time of incident |
| event | text | Description of the event or activity the person was engaged in at time of incident |
| tools_code | text | Code identifying any tools involved in the incident, if applicable |
| tools | text | Description of any tools involved in the incident, if applicable |
| injury_cause_code | text | Code identifying the cause of the injury or illness |
| injury_cause | text | Description of the cause of the injury or illness |
| days_away_from_work | numeric | Number of days the injured/ill person was away from work as a result of the incident |
| days_restricted_activity | numeric | Number of days the person had restricted work activity as a result of the incident |
| hazmat_exposure | text | Indicator of whether the casualty involved hazardous material exposure |
| covered_data_code | text | Code indicating whether certain data fields are withheld from public release |
| covered_data_reason | text | Explanation for why certain data fields are withheld from public release |
| latitude | numeric | Latitude coordinate of the incident location |
| longitude | numeric | Longitude coordinate of the incident location |
| narrative | text | Free-text description of the circumstances surrounding the casualty incident |
| employee_suspension | text | Indicator of whether the involved employee was suspended in connection with the incident |
| district | text | Railroad operating district where the incident occurred |
| fatality | text | Indicator of whether the casualty resulted in a fatality |
| form_57_filed | text | Indicator of whether a highway-rail grade crossing accident report (Form 6180.57) was filed for this incident |
| form_54_filed | text | Indicator of whether a rail equipment accident report (Form 6180.54) was filed for this incident |
| class_code | text | Code identifying the classification of the person involved (corresponds to Classes A–J) |
| class | text | Description of the person classification corresponding to the class code |
| casualty_occurrence_code | text | Code describing the circumstances under which the casualty occurred |
| equipment_movement_code | text | Code describing the movement status of on-track equipment at the time of the incident |
| report_key | text | Unique key identifying the associated accident or incident report |
| reporting_railroad_smt_grouping | text | SMT (Safety Management Team) grouping assigned to the reporting railroad by the FRA |
| reporting_parent_railroad_code | text | Code identifying the parent railroad of the reporting railroad |
| reporting_parent_railroad_name | text | Name of the parent railroad of the reporting railroad |
| reporting_railroad_holding_company | text | Name of the holding company that owns the reporting railroad |
| geocode | text | Geographic code associated with the incident location |
| incident_key | text | Unique identifier for the incident, constructed from railroad code, incident number, incident year, and incident month |
| reporting_railroad_individual_class | text | FRA size classification of the reporting railroad as an individual entity |
| reporting_railroad_passenger | text | Indicator of whether the reporting railroad operates passenger service |
| reporting_railroad_commuter | text | Indicator of whether the reporting railroad operates commuter service |
| reporting_railroad_switching_terminal | text | Indicator of whether the reporting railroad operates as a switching and terminal railroad |
| reporting_railroad_tourist | text | Indicator of whether the reporting railroad operates tourist service |
| reporting_railroad_freight | text | Indicator of whether the reporting railroad operates freight service |
| reporting_railroad_short_line | text | Indicator of whether the reporting railroad is classified as a short line railroad |
- You may not put this data into an LLM.
- Analysis in this competition should reflect your own work; although we cannot prohibit AI, you should not depend on it in your analysis. Obvious or extreme usage of LLMs will be considered in judging.
- Your team must include citations for forums (Stack Overflow), textbooks, and other references.
- You must provide a link to any AI chat conversations used.
- You must list all citations (including the original study’s citation found at the bottom of this
README) at the end of your code notebook. Failure in doing so will result in a large point deduction during judging. - You may reuse code from your own research or class projects.
- Citations should be listed in the code notebook file, presentation media, or a separate document.
- You must answer a question that your team chooses.
- All types of data modeling are allowed.
- All coding languages are allowed.
- If previous research has been done on this topic, you may not copy the study.
- You and your team should submit the following to your Google Drive folder:
- Your code notebook file(s) (QMD, R script, Jupyter, etc.)
- Presentation media (slides, PDF, link to an HTML, visualizations, etc)
- Citations to all outside resources: please read the Code of Conduct for more information
- A link to your forked GitHub repository
- Your team’s Google Drive folder contents will not be accessible to view, comment on, or edit by any other team.
- Winners will be announced by April 11th.
- As stated earlier, you must come up with a question that your team will use to guide your statistical analysis.
- A fair evaluation will be justified by the Statistical Science Majors Union's Competitions and Opportunities Committee.
- The judging criteria is as follows:
- Communication: Clarity of presentation, logical structure, and ability to explain findings to a broad audience. - 20%
- Creativity & Insight: Depth of analysis, creativity in approach, and ability to identify meaningful patterns. - 40%
- Reproducibility: Well-organized code, clear documentation, and transparent workflows. - 5%
- Visualization & Design: Effective use of charts and visuals to support your conclusions. - 25%
- Technical Soundness: Appropriate use of statistical methods, reasonable assumptions, and valid interpretations. - 10%
- This will be prioritized more than the experience level of the statistical methods your team members used. This will account for differing experience levels across competitors
This project utilizes the Injury/Illness Summary - Casualty Data dataset, maintained and publicly released by the Department of Transportation through <data.transportation.gov>. We are grateful to the Department of Trasnportation for making this valuable public resource freely available for research and analysis.
Federal Railroad Administration. (2026). Injury/Illness Summary - Casualty Data (Form 55a) - 6 Year View [Data set]. U.S. Department of Transportation. Retrieved from https://data.transportation.gov/dataset/Injury-Illness-Summary-Casualty-Data-Form-55a-6-Ye/rash-pd2d/about_data
This citation must be included at the end of your code notebook.
Data Mini #3 was organized by Duke's Statistical Science Majors Union - Competitions and Opportunities Committee. Contributing members of the committee and the SSMU executive team are listed below:
- Hanna Chee, ‘29
- Trisha Iyer, ‘28
- Hyunjin Lee, ‘27
- Susan Li, ‘29
- Phillip Lin, '27
- Liane Ma, ‘27
- Chelsea Nguyen, ‘28
- Cooper Ruffing, ‘27
- Derek Wang, ‘28
- Amy Xu, ‘27
- Allison Yang, ‘27
- And an additional thank-you goes out to the professors who helped advertise the event.