A bit of confusion surrounding assumptions for quality_scorer.py

Running into a snag regarding understanding expectations for quality_scorer.py, specifically the following section seems to be incorrectly constructed for certain metrics: 

https://github.com/hlt-mt/simulstream/blob/93c51b4dcdc54fd5bf1d3991a37ea18a7c57f5b1/simulstream/metrics/score_quality.py#L77-L95

When using a metric that is reliant on both the reference and a transcript (e.g. COMET with references), this seems to only work when the provided reference and transcript are named almost identically (or at least `Path(reference).stem` must be equivalent from [`ReferencesReader`](https://github.com/hlt-mt/simulstream/blob/main/simulstream/metrics/readers.py#L277)). Comparing that against the in-file documentation in `cli_main()` there appears to be a mismatch:

https://github.com/hlt-mt/simulstream/blob/93c51b4dcdc54fd5bf1d3991a37ea18a7c57f5b1/simulstream/metrics/score_quality.py#L122-L127

In practice, what seems to happen is that `audio_files_to_score` only sees the keys from `reference_dictionary` on L82. If the file stem isn't identical for the reference and transcript, then L88 will result in something like:

```
transcript = transcript_dictionary[audio_name]
KeyError: 'ref'
```

Have I misunderstood how this scorer is intended to be used? Does the documentation perhaps need to be updated or do L77-L95 need to be revised?




	if transcripts_reader is not None:
	transcript_dictionary = transcripts_reader.get_reference_texts()
	audio_files_to_score = transcript_dictionary.keys()
	if reference_reader is not None:
	reference_dictionary = reference_reader.get_reference_texts()
	audio_files_to_score = reference_dictionary.keys()

	scoring_samples = []
	for audio_name in audio_files_to_score:
	transcript = None
	if transcript_dictionary is not None:
	transcript = transcript_dictionary[audio_name]
	reference = None
	if reference_dictionary is not None:
	reference = reference_dictionary[audio_name]
	if transcript is not None and reference is not None:
	assert len(reference) == len(transcript), \
	f"Reference ({audio_name}) has mismatched number of target ({len(reference)}) " \
	f"and source lines ({len(transcript)})"

	$ python -m simulstream.metrics.score_quality \\
	--eval-config config/speech-processor.yaml \\
	--log-file metrics.jsonl \\
	--references ref.en \\
	--transcripts src.it \\
	--scorer sacrebleu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A bit of confusion surrounding assumptions for quality_scorer.py #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A bit of confusion surrounding assumptions for quality_scorer.py #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions