Skip to content

Possible Issue with utf-8 encoding under windows while reading dataset #80

@sstiene

Description

@sstiene

I try to load a dataset with agml under win11 with python 3.12 but get an error.

How to reproduce it:

import agml

dataset_name = 'carrot_weeds_germany'
loader = agml.data.AgMLDataLoader(dataset_name)
dataset_path = loader.dataset_root

print(f"Datensatz heruntergeladen nach: {dataset_path}")

The error is:

Traceback (most recent call last):
  File "c:\Users\stefa\HSOS\Vorbereitung Einführung in die KI (BAT) - General\Praktika\label_studio_51\agml_test.py", line 4, in <module>
    loader = agml.data.AgMLDataLoader(dataset_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\stefa\cv-env\Lib\site-packages\agml\data\loader.py", line 149, in __init__
    self._info = make_metadata(dataset, kwargs.get('meta', None))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\stefa\cv-env\Lib\site-packages\agml\data\metadata.py", line 49, in make_metadata
    return DatasetMetadata(name)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\stefa\cv-env\Lib\site-packages\agml\data\metadata.py", line 105, in __init__
    self._load_source_info(name)
  File "C:\Users\stefa\cv-env\Lib\site-packages\agml\data\metadata.py", line 148, in _load_source_info
    **load_citation_sources()[name], dataset = name)
      ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\stefa\cv-env\Lib\site-packages\agml\utils\data.py", line 38, in load_citation_sources
    return json.load(f)
           ^^^^^^^^^^^^
  File "C:\Users\stefa\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
                 ^^^^^^^^^
  File "C:\Users\stefa\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 22968: character maps to <undefined>

So it seams that there is an encoding issue while reading the citation_sources under Windows. I did test it with different datasets. Same error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions