This is a main repository of the translation tool. You should visit CLI repo if you want to translate your document using the CLI and LIB repo if you are interseted in using the translation library.
We explore the specific challenges of authoring and maintaining multilingual computational scientific narratives, like course notes, textbooks, or reference manuals, and the design space for leveraging adaptive machine translation to assist authors.
With the advancement of automated translation, there now exists plethora of tools and services for translating documents. These tools are well suited for one shot translations: author in one language; machine translate; proofread and postedit. Consider now a large document that evolves over a long period of time; say course notes that one wants to provide in, e.g. French and English, and maybe some other language. The above workflow is the not suitable anymore:
- The high value human effort of proofreading -- in particular in terms of choice of style and terminology -- is lost at each iteration.
- The authors may want to alternatively improve the document in one or the other language, and propagate the improvements to the other languages.
Instead, one wants workflows where changes in one language can be propagated to the other languages, not only leaving the rest of the text unchanged, but exploiting it has a source of aligned chunks of translated and proofread texts to guide the style and terminology of the translation (Adaptive Machine Translation). Also a seamless integration in the authoring environment and workflow is desirable.
With the advent and large scale availability of (adaptative) machine translation, LLM's, few shot learning, RAGs, time should be ripe to leverage that technology to have open source, sovereign, and privacy preserving tools supporting such workflows in the authors own authoring environment for, e.g., course notes written in some markup language like MyST/Markdown/Jupyter or LaTeX, collaboratively authored using a software forge. Either by adopting and deploying existing systems, or building a lightweight one from existing bricks.
If you use this software in your research, please cite it as follows:
@software{korotenko-sci-trans-git,
author = {Yehor Korotenko},
title = {sci-trans-git},
year = {2025},
publisher = {GitHub},
version = {0.2.0-alpha},
url = {https://github.com/DobbiKov/sci-trans-git},
doi = {10.5281/zenodo.15775111}
}In this section the state of the project is described as well as the reports that are done by the moment of the README edit.
Currently, the main goal is to improve the myst two-way parsing in order to
be able to parse it into XML tags and reconstruct it back.
In parallel, we explore possibilities to provide to the model the context of a document.
For now, the library translates jupyter notebooks by using jupytext
module to extract contents of cells, pass it to the models and extract
translation and LaTeX documents by using pylatexenc to construct an AST and
divide the document into chunks and then translate those chunks.
Also the library provides a functionality to translate myst and LaTeX files
as well following the same approach:
- Parse the code
- Identify and differentiate
syntaxpart fromhuman-textparts. - Construct
XMLcode - Translate via LLM
- Reconstruct the document back.
The translation is stored in the translation database, in order to not retranslate the translated chunks and just retrieve them.
The library also provides a possibility to correct translation (i.e rewrite or fix the translated file by the model and save the translation in the database so it is not overwritten in the future).
In order to improve translation quality and avoid ambiguity the vocabulary feature is provided in the library. For the translation command, there's an optional parameter that is a vocabulary (translation pairs) that would help the model to use the appropriate words and phrases that are presented in the vocabulary.
Slightly changed chunks that are stored in the database are provided to the LLM as an example in order to take the post-edits into account and follow the author's style of writing.
In order to simplify the library testing and presentation, a CLI application that implements library functionality has been developed.
- First report = the report presents the first prototype written in rust programming language presenting the idea of the tool. The report also provides useful information about the energy consumption of the existing high-performance models.
- Second report = the report presents the version of the library and the tool on python as well as with the first version control prototype.
- Aristote evaluation report = the
report provides the comparison of the translations of the
geminiandllama-3.3models. - Translation evaluation report =
the report that evaluates the quality of translation of different models
such as:
gemini,llamaandgemmafrom and to languages such as:
- English
- French
- German
- Ukrainian
- Translation evaluation tool report = the report presents the new tool for automatic translation evaluation using popular metrics.
- Latex chunking report = the report
presents the results of the work on developing the feature of dividing
LaTeXdocuments into chunks in order to simplify translation. - New ways of passing text to LLMs report = the report presents the results of the work on exploring the new way of passing text into LLMs in order to improve the reliability of structure preserving and reduce number of tasks that the models must handle simultaneously.
- One shot translation to preserve writing style report = the report presents the explorations and the results about preserving the style using one-shot prompting technique.
- library itself
- CLI library implemetation
- translation evaluation tool for automatic translation evaluation using reference translations.
- Improve
mystparsing. - Explore the ways to use the translation database and to provide the model the way and the style it should write the translation in.