Conversation
In particular, the following fields are not available in new format: - full_name_rg The following fields have been renamed: - full_name_ro -> full_name - lcd -> lang_cd
|
I am uploading a partial result, as it's still running (transliteration system detection is an expensive operation) and I want you to have the results to look at as soon as possible. I will update it when it finishes. This dataset contains directories named after the countries, each contains two files. One is "result.txt" which contains output summary of GeoTest. The second one is "errors.tsv" which specifies the individual issues (as described in the post above). The temporary dataset misses "errors.tsv" for the following countries:
|
|
The computation has completed. Unfortunately, the output doesn't fit GitHub limit of 25MB of attachment size, and GitHub attachments only support GZ compression, so I will split those into four files: _output2.tar.aa.gz To extract this, use the command: To create a single Ultimately, this took almost 2 days to compute on a 16-core Ryzen 9 5950x, where each country was processed in parallel (getting at most up to 20GiB of RAM usage for a single process, but usually much lower). I have looked at a potential of optimization, but didn't find anything spectacular - the real bottleneck is transliteration that needs to be done for each transliteration system detection - which means that we have to use each transliteration map if no hint is provided. |
See also: #1 interscript/geonames-transliteration-data#12
Do note that GeoTest doesn't depend on interscript/geonames-transliteration-data, so interscript/geonames-transliteration-data#12 is NOT fixed by this pull request.
The first commit makes GeoTest compatible with the new format.
The second commit introduces a way to generate an error file. An error file is a TSV file that is described in https://github.com/hmdne/geotest/blob/hmdne/new-format-error-file/errors_documentation.md . In particular, GeoTest gained an ability to infer a transliteration map if either transliteration is not correct (except for punctuation, spacing and casing errors, which are just displayed as errors) or
transl_cdis empty.I will provide a result of this computation as soon as it completes.