We use uv to manage dependencies and the project environment.
Clone the GitHub repository:
git clone https://github.com/MDverse/mdverse_entity_norm.git
cd mdverse_entity_normSync dependencies:
uv syncThis project consists of the normalisation step for mollecular dynamics entities. Currently, we have implemented the normalisation for temperature and the grounding for molecules. The normalisation and grounding processes are performed using the scripts located in the src/mdverse_entity_norm/scripts directory. Each script is designed to handle a specific type of entity and can be executed independently. The results of the normalisation and grounding processes are saved in the results directory, which is created if it does not already exist. The output files are in TSV format, containing the original entities and their corresponding normalized or grounded values, along with any relevant metadata such as confidence scores or error codes.
To normalize temperature entities, run :
uv run src/mdverse_entity_norm/scripts/normalize_temperature.pyThis command generates a file named
normalized_temperature.tsvin theresultsdirectory, containing the normalized temperature entities. The file has two columns:original_valueandnormalized_value, whereoriginal_valueis the original temperature entity andnormalized_valueis the normalized temperature entity in Celsius.
The logic behind the grounding of molecule entities is described in this image below :

To ground molecules entities, run :
uv run src/mdverse_entity_norm/scripts/ground_molecule.py --mol_filepath data/MOL.txt --grounded_mol_filepath results/grounded_molecules.tsv --non_grounded_mol_filepath results/non_grounded_molecules.tsvThis command generates two files in the
resultsdirectory:grounded_molecules.tsvandnon_grounded_molecules.tsv. Thegrounded_molecules.tsvfile contains the grounded molecule entities with their corresponding identifiers, while thenon_grounded_molecules.tsvfile contains the molecule entities that could not be grounded.
The grounded_molecules.tsv file has six columns :
Entity_name : corresponding to the original molecule name,
Database : corresponding to the database name,
ID : corresponding to the molecule ID,
Score : corresponding to the confidence score,
Name : corresponding to the molecule full name,
nb_res : corresponding to the number of results found.
Thenon_grounded_molecules.tsv file has two columns :
Entity_name : corresponding to the original molecule name that could not be grounded
error : corresponding to the error code obtained during the grounding process.
To normalize simulation time entities, run :
uv run src/mdverse_entity_norm/scripts/normalize_simulation_time.py --raw_simu_times_file data/STIME.txt --normalized_simulation_time results/norm_simu_times/normalized_simulation_time_gpt.json --ground_truth_file data/STIME_ground_truth.jsonThis command generates a file named
normalized_simulation_time.jsonin theresultsdirectory, containing the normalized simulation time entities. The file is in JSON format, where each key is a simulation time entity and its corresponding value is the normalized simulation time value and time unit in the standard format. If a simulation time entity could not be normalized, its value will benull.