Skip to content

DobbiKov/sci-trans-git

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Document translation tool

This is a main repository of the translation tool. You should visit CLI repo if you want to translate your document using the CLI and LIB repo if you are interseted in using the translation library.

  • Paper about the project: link
  • Poster about the project: link

TL;DR

We explore the specific challenges of authoring and maintaining multilingual computational scientific narratives, like course notes, textbooks, or reference manuals, and the design space for leveraging adaptive machine translation to assist authors.

Motivation and concept

With the advancement of automated translation, there now exists plethora of tools and services for translating documents. These tools are well suited for one shot translations: author in one language; machine translate; proofread and postedit. Consider now a large document that evolves over a long period of time; say course notes that one wants to provide in, e.g. French and English, and maybe some other language. The above workflow is the not suitable anymore:

  1. The high value human effort of proofreading -- in particular in terms of choice of style and terminology -- is lost at each iteration.
  2. The authors may want to alternatively improve the document in one or the other language, and propagate the improvements to the other languages.

Instead, one wants workflows where changes in one language can be propagated to the other languages, not only leaving the rest of the text unchanged, but exploiting it has a source of aligned chunks of translated and proofread texts to guide the style and terminology of the translation (Adaptive Machine Translation). Also a seamless integration in the authoring environment and workflow is desirable.

With the advent and large scale availability of (adaptative) machine translation, LLM's, few shot learning, RAGs, time should be ripe to leverage that technology to have open source, sovereign, and privacy preserving tools supporting such workflows in the authors own authoring environment for, e.g., course notes written in some markup language like MyST/Markdown/Jupyter or LaTeX, collaboratively authored using a software forge. Either by adopting and deploying existing systems, or building a lightweight one from existing bricks.

📚 Citation

If you use this software in your research, please cite it as follows:

@software{korotenko-sci-trans-git,
    author = {Yehor Korotenko},
    title = {sci-trans-git},
    year = {2025},
    publisher = {GitHub},
    version = {0.2.0-alpha},
    url = {https://github.com/DobbiKov/sci-trans-git},
    doi = {10.5281/zenodo.15775111}
}

The state of the project

In this section the state of the project is described as well as the reports that are done by the moment of the README edit.

State of the project

Currently, the main goal is to improve the myst two-way parsing in order to be able to parse it into XML tags and reconstruct it back.

In parallel, we explore possibilities to provide to the model the context of a document.

Library

For now, the library translates jupyter notebooks by using jupytext module to extract contents of cells, pass it to the models and extract translation and LaTeX documents by using pylatexenc to construct an AST and divide the document into chunks and then translate those chunks.

Also the library provides a functionality to translate myst and LaTeX files as well following the same approach:

  1. Parse the code
  2. Identify and differentiate syntax part from human-text parts.
  3. Construct XML code
  4. Translate via LLM
  5. Reconstruct the document back.

The translation is stored in the translation database, in order to not retranslate the translated chunks and just retrieve them.

The library also provides a possibility to correct translation (i.e rewrite or fix the translated file by the model and save the translation in the database so it is not overwritten in the future).

In order to improve translation quality and avoid ambiguity the vocabulary feature is provided in the library. For the translation command, there's an optional parameter that is a vocabulary (translation pairs) that would help the model to use the appropriate words and phrases that are presented in the vocabulary.

Slightly changed chunks that are stored in the database are provided to the LLM as an example in order to take the post-edits into account and follow the author's style of writing.

CLI

In order to simplify the library testing and presentation, a CLI application that implements library functionality has been developed.

Reports

  1. First report = the report presents the first prototype written in rust programming language presenting the idea of the tool. The report also provides useful information about the energy consumption of the existing high-performance models.
  2. Second report = the report presents the version of the library and the tool on python as well as with the first version control prototype.
  3. Aristote evaluation report = the report provides the comparison of the translations of the gemini and llama-3.3 models.
  4. Translation evaluation report = the report that evaluates the quality of translation of different models such as: gemini, llama and gemma from and to languages such as:
  • English
  • French
  • German
  • Ukrainian
  1. Translation evaluation tool report = the report presents the new tool for automatic translation evaluation using popular metrics.
  2. Latex chunking report = the report presents the results of the work on developing the feature of dividing LaTeX documents into chunks in order to simplify translation.
  3. New ways of passing text to LLMs report = the report presents the results of the work on exploring the new way of passing text into LLMs in order to improve the reliability of structure preserving and reduce number of tasks that the models must handle simultaneously.
  4. One shot translation to preserve writing style report = the report presents the explorations and the results about preserving the style using one-shot prompting technique.

Resources presented in the repository

Current development direction

  • Improve myst parsing.
  • Explore the ways to use the translation database and to provide the model the way and the style it should write the translation in.