Document translation tool

This is a main repository of the translation tool. You should visit CLI repo if you want to translate your document using the CLI and LIB repo if you are interseted in using the translation library.

Paper about the project: link
Poster about the project: link

TL;DR

We explore the specific challenges of authoring and maintaining multilingual computational scientific narratives, like course notes, textbooks, or reference manuals, and the design space for leveraging adaptive machine translation to assist authors.

Motivation and concept

With the advancement of automated translation, there now exists plethora of tools and services for translating documents. These tools are well suited for one shot translations: author in one language; machine translate; proofread and postedit. Consider now a large document that evolves over a long period of time; say course notes that one wants to provide in, e.g. French and English, and maybe some other language. The above workflow is the not suitable anymore:

The high value human effort of proofreading -- in particular in terms of choice of style and terminology -- is lost at each iteration.
The authors may want to alternatively improve the document in one or the other language, and propagate the improvements to the other languages.

Instead, one wants workflows where changes in one language can be propagated to the other languages, not only leaving the rest of the text unchanged, but exploiting it has a source of aligned chunks of translated and proofread texts to guide the style and terminology of the translation (Adaptive Machine Translation). Also a seamless integration in the authoring environment and workflow is desirable.

With the advent and large scale availability of (adaptative) machine translation, LLM's, few shot learning, RAGs, time should be ripe to leverage that technology to have open source, sovereign, and privacy preserving tools supporting such workflows in the authors own authoring environment for, e.g., course notes written in some markup language like MyST/Markdown/Jupyter or LaTeX, collaboratively authored using a software forge. Either by adopting and deploying existing systems, or building a lightweight one from existing bricks.

📚 Citation

If you use this software in your research, please cite it as follows:

@software{korotenko-sci-trans-git,
    author = {Yehor Korotenko},
    title = {sci-trans-git},
    year = {2025},
    publisher = {GitHub},
    version = {0.2.0-alpha},
    url = {https://github.com/DobbiKov/sci-trans-git},
    doi = {10.5281/zenodo.15775111}
}

The state of the project

In this section the state of the project is described as well as the reports that are done by the moment of the README edit.

State of the project

Currently, the main goal is to improve the myst two-way parsing in order to be able to parse it into XML tags and reconstruct it back.

In parallel, we explore possibilities to provide to the model the context of a document.

Library

For now, the library translates jupyter notebooks by using jupytext module to extract contents of cells, pass it to the models and extract translation and LaTeX documents by using pylatexenc to construct an AST and divide the document into chunks and then translate those chunks.

Also the library provides a functionality to translate myst and LaTeX files as well following the same approach:

Parse the code
Identify and differentiate syntax part from human-text parts.
Construct XML code
Translate via LLM
Reconstruct the document back.

The translation is stored in the translation database, in order to not retranslate the translated chunks and just retrieve them.

The library also provides a possibility to correct translation (i.e rewrite or fix the translated file by the model and save the translation in the database so it is not overwritten in the future).

In order to improve translation quality and avoid ambiguity the vocabulary feature is provided in the library. For the translation command, there's an optional parameter that is a vocabulary (translation pairs) that would help the model to use the appropriate words and phrases that are presented in the vocabulary.

Slightly changed chunks that are stored in the database are provided to the LLM as an example in order to take the post-edits into account and follow the author's style of writing.

CLI

In order to simplify the library testing and presentation, a CLI application that implements library functionality has been developed.

Reports

First report = the report presents the first prototype written in rust programming language presenting the idea of the tool. The report also provides useful information about the energy consumption of the existing high-performance models.
Second report = the report presents the version of the library and the tool on python as well as with the first version control prototype.
Aristote evaluation report = the report provides the comparison of the translations of the gemini and llama-3.3 models.
Translation evaluation report = the report that evaluates the quality of translation of different models such as: gemini, llama and gemma from and to languages such as:

English
French
German
Ukrainian

Translation evaluation tool report = the report presents the new tool for automatic translation evaluation using popular metrics.
Latex chunking report = the report presents the results of the work on developing the feature of dividing LaTeX documents into chunks in order to simplify translation.
New ways of passing text to LLMs report = the report presents the results of the work on exploring the new way of passing text into LLMs in order to improve the reliability of structure preserving and reduce number of tasks that the models must handle simultaneously.
One shot translation to preserve writing style report = the report presents the explorations and the results about preserving the style using one-shot prompting technique.

Resources presented in the repository

library itself
CLI library implemetation
translation evaluation tool for automatic translation evaluation using reference translations.

Current development direction

Improve myst parsing.
Explore the ways to use the translation database and to provide the model the way and the style it should write the translation in.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
clis/prompt_testing		clis/prompt_testing
docs		docs
jdse-paper		jdse-paper
prompt_testing_lib @ d0ffd95		prompt_testing_lib @ d0ffd95
prompts		prompts
prototype-translate-dir-cli @ 279bdf0		prototype-translate-dir-cli @ 279bdf0
prototype-translate-dir-lib @ b14a4f9		prototype-translate-dir-lib @ b14a4f9
prototype/prototype		prototype/prototype
py-edu-fr @ 4c305bc		py-edu-fr @ 4c305bc
reports		reports
translate-dir-cli @ ad2e58b		translate-dir-cli @ ad2e58b
translate-dir-lib @ 0d2fa9d		translate-dir-lib @ 0d2fa9d
translation-evaluator @ b1ac3d6		translation-evaluator @ b1ac3d6
typst/prototype_report		typst/prototype_report
unified_model_caller @ e764915		unified_model_caller @ e764915
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
tagging.ipynb		tagging.ipynb
translation_few_changed.ipynb		translation_few_changed.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document translation tool

TL;DR

Motivation and concept

📚 Citation

The state of the project

State of the project

Library

CLI

Reports

Resources presented in the repository

Current development direction

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document translation tool

TL;DR

Motivation and concept

📚 Citation

The state of the project

State of the project

Library

CLI

Reports

Resources presented in the repository

Current development direction

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages