Feature/lmdb optimization by Alexanderlacuna · Pull Request #6 · Alexanderlacuna/correlation_rust

Alexanderlacuna · 2026-03-11T13:11:36Z

test feature for reading from lmdb

Add optional 'strains' field to JsonData struct to support LMDB correlation. This field provides strain names that correspond to x_vals indices, enabling Rust to extract the correct columns from the LMDB matrix. - Remove 'use_lmdb' flag (auto-detect via directory check instead) - Add strains: Option<Vec<String>> with serde(default) - Update test to use new field structure

Add strains field to Compute struct and update compute_from_lmdb() to extract specific columns based on strain names. - Add strains: Option<&[str]> field to Compute struct - Add new_with_strains() constructor for LMDB mode - Update compute_from_lmdb() to: * Build strain -> column index mapping from LMDB metadata * Extract only specified columns when strains provided * Validate all requested strains exist in dataset * Fall back to all columns when strains is None (backward compat) - Update existing tests to include strains field - Update input_lmdb.json example with strains array This enables correlation with any subset of strains while maintaining backward compatibility with existing CSV mode.

Add unit tests for the new strains functionality: - test_compute_with_strains(): Verify new_with_strains() constructor - test_compute_without_strains(): Verify backward compatibility (None strains) - test_lmdb_correlation_with_strains(): Integration test with subset of strains - test_lmdb_correlation_all_strains(): Integration test with all strains (legacy) Integration tests marked with #[ignore] since they require: - GSL system library - Actual LMDB dataset at /home/kabui/lmdb_data/GTEx_Liver_0414/ Run ignored tests with: cargo test -- --ignored Basic unit tests run without external dependencies.

Update Analysis::compute() to detect and use strains from JSON input. - Check if 'strains' field is present in JSON - If yes: Use new_with_strains() for LMDB column extraction - If no: Use standard new() for backward compatibility - Add tests for JSON parsing with and without strains This completes the pipeline: JSON -> Analysis -> Compute (with strains)

Add comprehensive documentation for the LMDB correlation feature: - LMDB_USAGE.md: Complete usage guide with: * JSON input formats (with/without strains) * Auto-detection behavior (directory vs file) * Step-by-step usage examples * Error handling guide * Performance comparison table * Architecture diagram - input_lmdb_full.json: Complete working example with: * Real strain names from GTEx_Liver_0414 dataset * x_vals aligned to strains * All required fields This documentation enables GN2/GN3 developers to integrate LMDB correlation support.

Add lmdb_exists() helper and early return in all LMDB tests to gracefully skip when test data is not present. Fixes test failures on systems without /home/kabui/lmdb_data/GTEx_Liver_0414/

Update all hardcoded LMDB paths from GTEx_Liver_0414 to HC_M2_0606_P to match the current dataset location. Files updated: - src/lmdb_reader.rs: LMDB_PATH constant - src/correlations.rs: Test paths - LMDB_USAGE.md: Documentation example - tests/data/input_lmdb*.json: Example files

Allow custom LMDB paths for testing via LMDB_TEST_PATH env var. Changes: - src/lmdb_reader.rs: Add get_lmdb_path() helper that checks env var - src/correlations.rs: Update integration tests to use env var - LMDB_USAGE.md: Document testing with custom paths Default path: /home/kabui/lmdb_data/HC_M2_0606_P/ Override: LMDB_TEST_PATH=/custom/path cargo test -- --ignored This makes tests portable across different environments without code changes.

Fix two issues introduced in previous commit: 1. lmdb_reader.rs: Fix mangled '0026' prefixes from sed commands - Replace all '0026get_lmdb_path()' with proper '&get_lmdb_path()' - Rewrite file to ensure clean syntax 2. correlations.rs: Add '&' to borrow String as &str - '&lmdb_path' instead of 'lmdb_path' for new_with_strains() - Same fix for new() in second test The LMDB_TEST_PATH feature now compiles correctly.

Fix two issues in LMDB integration tests: 1. Wrong strain names: Tests were using GTEx strains ('1_100_15') but HC_M2_0606_P has BXD strains. Now reads actual strains from LMDB metadata and uses first 6 for subset test. 2. LMDB lock conflict: Tests were failing with MDB_BAD_RSLOT because LmdbReader wasn't dropped before Compute opened new connection. Now explicitly drops reader and adds 100ms sleep between opens. Also improved test diagnostics with eprintln output showing: - Which strains are being used - How many traits were correlated - Output file validation Tests are marked #[ignore] and require GSL + LMDB to run.

Add comprehensive documentation for running with cargo: - Quick Start section with common cargo run commands - Option B in Step 3: Run with Cargo (development) - Environment variables (LMDB_TEST_PATH, RUST_LOG) - Python one-liner to check available strains in LMDB This helps developers test changes without recompiling the full binary each time.

Replace GTEx strains (1_100_15, etc.) with proper BXD strain names: - BXD1, BXD2, BXD5, BXD6, BXD8, BXD9 - BXD11, BXD12, BXD13, BXD14, BXD15, BXD16 HC_M2_0606_P is a Hippocampus Consortium M430v2 dataset which uses BXD recombinant inbred mouse strains, not GTEx human samples. Fixes 'Strain mismatch' error when running with actual dataset.

Add critical validation to ensure strains and x_vals arrays match: 1. correlations.rs: Check that strains.len() == x_vals.len() - Returns clear error if mismatch detected - Error message explains the requirement 2. input_lmdb.json: Add _comment explaining alignment requirement - strains[i] must correspond to x_vals[i] - Both arrays must have same length and order This prevents silent errors from mismatched input data. Example error: 'Array length mismatch: 10 strains but 12 x_vals' Total commits: 13

JSON strings cannot use backslash for line continuation. Put _comment on single line to fix parse error. Error was: 'invalid escape' at line 3

Add debugging output to strain mismatch error to show first 10 available strains from LMDB metadata. This helps diagnose which strains are actually in the dataset.

The actual strains in HC_M2_0606_P are simple numbers like '1', '2', '5', NOT 'BXD1', 'BXD2'. This dataset uses numeric strain identifiers. Valid strains (first 12): '1', '2', '5', '6', '8', '9', '11', '12', '14', '15', '17', '18' Missing from original: '13', '16' do not exist in this dataset Fixes 'Strain mismatch: requested 12 strains, found 10' error.

HC_M2_0606_P uses numeric strain IDs ('1', '2', '5'), not BXD names. Update all documentation and examples to reflect this. Files updated: - tests/data/input_lmdb_full.json - LMDB_USAGE.md (3 occurrences) Note: Different datasets may use different strain naming conventions. Always check actual strains with: guix shell python python-lmdb -- python3 -c ...

Update all examples to use descriptive strain names instead of IDs: - input_lmdb.json: Use 'C57BL/6J', 'DBA/2J', 'BXD1', etc. - input_lmdb_full.json: Same update with explanatory comment - LMDB_USAGE.md: Update all strain examples to use names - check_strains.py: New helper script to inspect LMDB strains The _comment field in JSON explains that users should check actual strain names in their LMDB after re-dumping with the updated batch_lmdb_metadata.py script. Example strain names are typical for mouse datasets: - C57BL/6J (parental strain) - DBA/2J (parental strain) - BXD1, BXD2, etc. (recombinant inbred strains)

The LMDB was re-dumped with the updated batch_lmdb_metadata.py and now contains descriptive strain names instead of IDs. Updated strains from error message: - BXD11, BXD12, BXD45, BXD48, BXD50, BXD51 - BXD55, B6D2F1, D2B6F1, PWK/PhJ, C57BL/6J, DBA/2J Fixes 'Strain mismatch: requested 12 strains, found 11' error.

Add parallel processing support using rayon crate: 1. Cargo.toml: Add rayon = 1.7 dependency 2. parser.rs: Add 'parallel' field to JsonData (default: true) 3. correlations.rs: - Add parallel: bool field to Compute struct - Add new_with_strains_parallel() constructor - Implement both parallel (par_iter) and sequential versions - Collect all traits first, then process in parallel - Results are identical between modes 4. analysis.rs: Pass parallel flag from JSON to Compute 5. JSON examples: Add 'parallel': true flag Testing: - Sequential: 497ms - Parallel: 502ms - Output files: IDENTICAL (diff produces no output) Note: Speedup depends on dataset size and CPU cores. For 45K traits, overhead may offset gains on some systems. Total commits: 14

Tested with 500K traits in release mode: Parallel mode: 1.18s Sequential mode: 1.74s Speedup: 1.47x (47% faster) Results identical between modes (verified with diff). Release build is 3-5x faster than debug build. Total speedup from debug sequential to release parallel: 4.8x

Tested with 2,000,000 traits in release mode: Parallel: 5.32s Sequential: 7.65s Speedup: 1.44x (44% faster) Output: 116MB file with 2 million correlations Results verified identical between modes. Performance scales well with dataset size: - 500K traits: ~1.2s - 2M traits: ~5.3s (4.5x scaling)

- Add file_type field to JsonData with default 'csv' - Update Analysis::compute to use file_type instead of auto-detection - Default is 'csv' for backward compatibility - Set to 'lmdb' for LMDB mode with strain extraction

- Gracefully drop strains not in dataset and continue - Add LMDB_TEST_LOCK mutex for parallel test safety - Use filtered_x_vals aligned with valid strains - Warn about filtered strains to stderr

- input_self_correlation.json: validate r=1.0 for self-correlation - input_missing_strains_*.json: test strain filtering behavior - input_lmdb_all_strains.json: test with all 99 strains

- Filter NaN from x_vals at start (once, not 45k times) - Validate sufficient samples after NaN filtering - Clear error messages for insufficient valid data - Remove redundant x-NaN checks from inner loops - Warn user about filtered samples Improves performance and correctness: ~50% fewer NaN checks

Alexanderlacuna added 30 commits March 11, 2026 13:29

feat: Add new dependency: lmdb.

b19470f

feat: Add lmdb reader.

d78d165

feat: Add lmdb reader functionality.

53464d5

test(lmdb_reader): skip tests when LMDB data unavailable

525b063

Add lmdb_exists() helper and early return in all LMDB tests to gracefully skip when test data is not present. Fixes test failures on systems without /home/kabui/lmdb_data/GTEx_Liver_0414/

fix: remove invalid JSON backslash escape in comment

0fdd577

JSON strings cannot use backslash for line continuation. Put _comment on single line to fix parse error. Error was: 'invalid escape' at line 3

debug: show available strains in error message

4019840

Add debugging output to strain mismatch error to show first 10 available strains from LMDB metadata. This helps diagnose which strains are actually in the dataset.

Add test JSON files for large LMDB performance testing

3f69c69

feat: filter missing strains instead of erroring

db8623e

- Gracefully drop strains not in dataset and continue - Add LMDB_TEST_LOCK mutex for parallel test safety - Use filtered_x_vals aligned with valid strains - Warn about filtered strains to stderr

test: add strain filtering and correctness test files

a14c4da

- input_self_correlation.json: validate r=1.0 for self-correlation - input_missing_strains_*.json: test strain filtering behavior - input_lmdb_all_strains.json: test with all 99 strains

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/lmdb optimization#6

Feature/lmdb optimization#6
Alexanderlacuna wants to merge 30 commits intomainfrom
feature/lmdb-optimization

Alexanderlacuna commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alexanderlacuna commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant