Skip to content

Loading dediacritic tool fails due to Emoji library dependency, and tokenizer model_max_length seems incorrect #5

@ghost

Description

Hello,

  1. The dediacritic tool doesn't seem to work within Google Colab with Python version 3.7. I tried to manually modify the Emoji library but to no result.
from google.colab import output
output.enable_custom_widget_manager()

from google.colab import drive
drive.mount('/content/drive/') 

!pip install camel-tools==1.4.1 -f https://download.pytorch.org/whl/torch_stable.html
os.environ['CAMELTOOLS_DATA'] = '/content/drive/MyDrive/SAAL/EnAr/CAMeL'
!camel_data -i all

from camel_tools.utils.dediac import dediac_ar

image

  1. After loading the tokenizer using Hugging Face's AutoTokenizer, I have to set the tokenizer model_max_length maually to 512, otherwise the value is an extremely large integer > 1e10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions