Skip to content
uncle-hedgehog edited this page Apr 9, 2011 · 2 revisions

Welcome to the stringdistances wiki!

We can use the main page for all things until we feel the need to clean up :)

Suggestions from Johannes

  • In the R code, we should make sure that the functions are always called with the shorter string before the longer string.

  • The arguments and their conversion to the form in the c-distance-functions have to be handled with care. they can easily become the bottleneck.

  • Re-think the conversion to wide-strings. Maybe its better to work with multi-byte data. Have a look on the levenshtein implementation they did for the postgre database: http://doxygen.postgresql.org/levenshtein_8c.html. They store the length of each multi-byte character and if they are no they just run over the chars, without transformation. Also these stopping criteria like one string has zero length can be filtered out before any transformation applies.

Clone this wiki locally