[Dataset] MathDial Corpus (math word problem tutoring dialogs)
Dataset name
MathDial (ConvoKit format)
Brief description
A collection of human–tutor dialogues focused on solving math word problems.
This dataset is adapted from the original MathDial corpus and converted into the ConvoKit format.
It contains both the original 4-way tutor intents and fine-grained 11-way tutor intents (for teacher turns).
Dataset details
Speaker-level
Speakers are either teachers or students. Some conversations include named students (e.g., Cody, Mariana), while others use the generic “Student.”
Each speaker is represented consistently within a conversation.
- id:
<Name>_<conversation_id> (e.g., Teacher_conv12, DeAndre_conv284)
- meta.role: normalized role —
"teacher" or "student"
- meta.role_raw: the original value from the TSV (e.g.,
"Teacher", "Student", "DeAndre")
- meta.conversation_id: conversation this speaker belongs to (e.g.,
conv284)
- meta.split: dataset split (
train, val, test)
Utterance-level
Each conversational turn is an utterance.
- id: global utterance identifier
- speaker: speaker who produced the utterance
- conversation_id: conversation identifier (e.g.,
conv0, conv1, …)
- reply_to: previous utterance in the thread (None if start)
- timestamp: not available
- text: textual content of the utterance
Metadata for each utterance includes:
intent_4: coarse 4-way intent label
intent_11: fine 11-way intent label (teacher turns only)
qid: problem identifier
Conversation-level
Metadata for each conversation includes:
- conversation_id: global string identifier (
conv0, conv1, …)
- split: which split (
train, val, test)
- qid: problem ID
- scenario: description of the problem context
- question: math problem text
- ground_truth: correct solution
- student_incorrect_solution: incorrect solution given by the student
- student_profile: information about the student (if available)
- teacher_described_confusion, self-correctness, self-typical-confusion, self-typical-interactions: additional annotation fields
Corpus-level
Licensing information
The MathDial dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).
Citation
Petukhova, Kseniia, and Ekaterina Kochmar. "Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation." arXiv preprint arXiv:2506.07626, 2025.
Contact
Jadon Geathers — jag569@cornell.edu
Access
Here is a link to the zipped corpus:
mathdial.zip
Statistics
- Conversations: 521
- Utterances: 8,466
- Speakers: 1,043
Top fine-grained (11-way) tutor intents:
| Intent |
Count |
| Revealing Strategy |
1141 |
| Revealing Answer |
895 |
| Guiding Student Focus |
687 |
| Seek Strategy |
658 |
| Asking for Explanation |
653 |
| Seeking Self Correction |
643 |
| Seeking World Knowledge |
257 |
| Greeting/Farewell |
217 |
| Recall Relevant Information |
93 |
| Perturbing the Question |
89 |
| General Inquiry |
40 |
Example usage
from convokit import Corpus
corpus = Corpus("PATH_TO/mathdial")
corpus.print_summary_stats()
[Dataset] MathDial Corpus (math word problem tutoring dialogs)
Dataset name
MathDial (ConvoKit format)
Brief description
A collection of human–tutor dialogues focused on solving math word problems.
This dataset is adapted from the original MathDial corpus and converted into the ConvoKit format.
It contains both the original 4-way tutor intents and fine-grained 11-way tutor intents (for teacher turns).
Dataset details
Speaker-level
Speakers are either teachers or students. Some conversations include named students (e.g., Cody, Mariana), while others use the generic “Student.”
Each speaker is represented consistently within a conversation.
<Name>_<conversation_id>(e.g.,Teacher_conv12,DeAndre_conv284)"teacher"or"student""Teacher","Student","DeAndre")conv284)train,val,test)Utterance-level
Each conversational turn is an utterance.
conv0,conv1, …)Metadata for each utterance includes:
intent_4: coarse 4-way intent labelintent_11: fine 11-way intent label (teacher turns only)qid: problem identifierConversation-level
Metadata for each conversation includes:
conv0,conv1, …)train,val,test)Corpus-level
mathdialLicensing information
The MathDial dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).
Citation
Petukhova, Kseniia, and Ekaterina Kochmar. "Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation." arXiv preprint arXiv:2506.07626, 2025.
Contact
Jadon Geathers — jag569@cornell.edu
Access
Here is a link to the zipped corpus:
mathdial.zip
Statistics
Top fine-grained (11-way) tutor intents:
Example usage