Create functionality to extract LOINC Delta updates to generate additional embeddings#532
Create functionality to extract LOINC Delta updates to generate additional embeddings#532BradySkylight wants to merge 13 commits into
Conversation
… full extraction and updates to medical terminologies
… dates to get delta and embedding candidates
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #532 +/- ##
==========================================
- Coverage 95.81% 95.78% -0.04%
==========================================
Files 46 46
Lines 2344 2372 +28
==========================================
+ Hits 2246 2272 +26
- Misses 98 100 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
robertandremitchell
left a comment
There was a problem hiding this comment.
looks good! my concerns are mostly about documentation and aligning code with the other packages modules.
|
|
||
|
|
||
| # Value Set Directories | ||
| SNOINC_DIRECTORY = "./data/snoinc_extracts" |
There was a problem hiding this comment.
more general feedback on different points of these directories / folders, we should use an OS-agnostic initialization akin to what the validation works does here: https://github.com/CDCgov/dibbs-text-to-code/blob/main/packages/validation/src/validation/main.py#L7-L21
There was a problem hiding this comment.
I can try to implement something like that. I do like/prefer the use of os vs. other packages though. I'll address this in the next PR, as it's in flight now for the embeddings.
| UMLS_API_KEY = os.environ.get("UMLS_API_KEY") | ||
|
|
||
|
|
||
| def clean_text_string(value: str) -> str: |
There was a problem hiding this comment.
also more general, but we should have doc strings on these imo
There was a problem hiding this comment.
Can add those to all the functions for sure.
| hl7_rows = [] | ||
|
|
||
| if hl7_response.status_code != 200: | ||
| print( |
There was a problem hiding this comment.
if we do envision this being pulled into the index-lambda or another service, we should use the logging library over printing
There was a problem hiding this comment.
Agreed. This can be tackled in the next PR(s) as that is when we will start to wrap it in a lambda. For now, it's nice to have a way to see progress when running locally. But I'll add TODOs for each 'print' to convert to a logging statement instead.
Description
This PR includes changes to make it possible to perform updates against, at least LOINC for now, various medical terminology value-sets utilized by TTC. Below is a list of the changes you will see:
utilpython script.loinc lab names, as this is the only valueset loaded in our model at this point.Related Issues
Closes #452
Additional Notes
I haven't added in tests, but have tested along the way and have some print statements included for now just to see the end result. The output will be passed to the next step - create the embeddings - which will then be stored as files somewhere and then uploaded into OpenSearch.
Checklist for Reviewers
Please review and complete the following checklist during the review process: