-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Running into a snag regarding understanding expectations for quality_scorer.py, specifically the following section seems to be incorrectly constructed for certain metrics:
simulstream/simulstream/metrics/score_quality.py
Lines 77 to 95 in 93c51b4
| if transcripts_reader is not None: | |
| transcript_dictionary = transcripts_reader.get_reference_texts() | |
| audio_files_to_score = transcript_dictionary.keys() | |
| if reference_reader is not None: | |
| reference_dictionary = reference_reader.get_reference_texts() | |
| audio_files_to_score = reference_dictionary.keys() | |
| scoring_samples = [] | |
| for audio_name in audio_files_to_score: | |
| transcript = None | |
| if transcript_dictionary is not None: | |
| transcript = transcript_dictionary[audio_name] | |
| reference = None | |
| if reference_dictionary is not None: | |
| reference = reference_dictionary[audio_name] | |
| if transcript is not None and reference is not None: | |
| assert len(reference) == len(transcript), \ | |
| f"Reference ({audio_name}) has mismatched number of target ({len(reference)}) " \ | |
| f"and source lines ({len(transcript)})" |
When using a metric that is reliant on both the reference and a transcript (e.g. COMET with references), this seems to only work when the provided reference and transcript are named almost identically (or at least Path(reference).stem must be equivalent from ReferencesReader). Comparing that against the in-file documentation in cli_main() there appears to be a mismatch:
simulstream/simulstream/metrics/score_quality.py
Lines 122 to 127 in 93c51b4
| $ python -m simulstream.metrics.score_quality \\ | |
| --eval-config config/speech-processor.yaml \\ | |
| --log-file metrics.jsonl \\ | |
| --references ref.en \\ | |
| --transcripts src.it \\ | |
| --scorer sacrebleu |
In practice, what seems to happen is that audio_files_to_score only sees the keys from reference_dictionary on L82. If the file stem isn't identical for the reference and transcript, then L88 will result in something like:
transcript = transcript_dictionary[audio_name]
KeyError: 'ref'
Have I misunderstood how this scorer is intended to be used? Does the documentation perhaps need to be updated or do L77-L95 need to be revised?