-
Notifications
You must be signed in to change notification settings - Fork 2
Description
As mentioned below, we wish to implement a generic evaluation method "for one dataset, run all metrics, do a correlation.
Developing this for ASL Citizen would be good. We could then use it to run both the SignCLIP embeddings metric and the distance metrics in #4
so, what this file does, is get distances specifically signclip for asl_citizen?
that is a start, but, it would be best if what we could do is:Given a directory of poses in various classes, for example
poses/class/X, can iterate over all of the metrics, and run them to calculate the k-nearest neighbor for each sample, for classification.
Then, once we have like 8 metrics, we can run them all, and see which one gives the best classification score (and that would be considered the best metric, for form based comparison, for single signs) - see https://github.com/sign-language-processing/signwriting-evaluation/blob/main/signwriting_evaluation/evaluation/closest_matches.py#L93-L107Another example, would be to have a directory of poses
poses-dgs/where each pose has a.txtfile associated with it. Let's assume 1000 sentences in german sign language and german.
Then, we can perform an all-to-all similarity between the poses, and an all-to-all similarity between the texts (using xCOMET for example) and perform a correlation study. whichever metric correlates best with xCOMET is the best metric for semantic sentence comparison.What I am trying to say is: we develop a generic evaluation that tries to say something "for 1 dataset type, run all metrics, correlate with something"
and then we can perform this on many datasets and inform the reader about the best metrics.
Then, when someone comes and says "i developed a new metric" they run it on everything, like GLUE basically, and we can see the upsides and downsides.
Originally posted by @AmitMY in #5 (comment)