Skip to content

Feature/lmdb optimization#6

Open
Alexanderlacuna wants to merge 30 commits intomainfrom
feature/lmdb-optimization
Open

Feature/lmdb optimization#6
Alexanderlacuna wants to merge 30 commits intomainfrom
feature/lmdb-optimization

Conversation

@Alexanderlacuna
Copy link
Copy Markdown
Owner

test feature for reading from lmdb

Add optional 'strains' field to JsonData struct to support LMDB correlation.
This field provides strain names that correspond to x_vals indices,
enabling Rust to extract the correct columns from the LMDB matrix.

- Remove 'use_lmdb' flag (auto-detect via directory check instead)
- Add strains: Option<Vec<String>> with serde(default)
- Update test to use new field structure
Add strains field to Compute struct and update compute_from_lmdb()
to extract specific columns based on strain names.

- Add strains: Option<&[str]> field to Compute struct
- Add new_with_strains() constructor for LMDB mode
- Update compute_from_lmdb() to:
  * Build strain -> column index mapping from LMDB metadata
  * Extract only specified columns when strains provided
  * Validate all requested strains exist in dataset
  * Fall back to all columns when strains is None (backward compat)
- Update existing tests to include strains field
- Update input_lmdb.json example with strains array

This enables correlation with any subset of strains while maintaining
backward compatibility with existing CSV mode.
Add unit tests for the new strains functionality:

- test_compute_with_strains(): Verify new_with_strains() constructor
- test_compute_without_strains(): Verify backward compatibility (None strains)
- test_lmdb_correlation_with_strains(): Integration test with subset of strains
- test_lmdb_correlation_all_strains(): Integration test with all strains (legacy)

Integration tests marked with #[ignore] since they require:
- GSL system library
- Actual LMDB dataset at /home/kabui/lmdb_data/GTEx_Liver_0414/

Run ignored tests with:
  cargo test -- --ignored

Basic unit tests run without external dependencies.
Update Analysis::compute() to detect and use strains from JSON input.

- Check if 'strains' field is present in JSON
- If yes: Use new_with_strains() for LMDB column extraction
- If no: Use standard new() for backward compatibility
- Add tests for JSON parsing with and without strains

This completes the pipeline: JSON -> Analysis -> Compute (with strains)
Add comprehensive documentation for the LMDB correlation feature:

- LMDB_USAGE.md: Complete usage guide with:
  * JSON input formats (with/without strains)
  * Auto-detection behavior (directory vs file)
  * Step-by-step usage examples
  * Error handling guide
  * Performance comparison table
  * Architecture diagram

- input_lmdb_full.json: Complete working example with:
  * Real strain names from GTEx_Liver_0414 dataset
  * x_vals aligned to strains
  * All required fields

This documentation enables GN2/GN3 developers to integrate
LMDB correlation support.
Add lmdb_exists() helper and early return in all LMDB tests
to gracefully skip when test data is not present.

Fixes test failures on systems without /home/kabui/lmdb_data/GTEx_Liver_0414/
Update all hardcoded LMDB paths from GTEx_Liver_0414 to HC_M2_0606_P
to match the current dataset location.

Files updated:
- src/lmdb_reader.rs: LMDB_PATH constant
- src/correlations.rs: Test paths
- LMDB_USAGE.md: Documentation example
- tests/data/input_lmdb*.json: Example files
Allow custom LMDB paths for testing via LMDB_TEST_PATH env var.

Changes:
- src/lmdb_reader.rs: Add get_lmdb_path() helper that checks env var
- src/correlations.rs: Update integration tests to use env var
- LMDB_USAGE.md: Document testing with custom paths

Default path: /home/kabui/lmdb_data/HC_M2_0606_P/
Override: LMDB_TEST_PATH=/custom/path cargo test -- --ignored

This makes tests portable across different environments without
code changes.
Fix two issues introduced in previous commit:

1. lmdb_reader.rs: Fix mangled '0026' prefixes from sed commands
   - Replace all '0026get_lmdb_path()' with proper '&get_lmdb_path()'
   - Rewrite file to ensure clean syntax

2. correlations.rs: Add '&' to borrow String as &str
   - '&lmdb_path' instead of 'lmdb_path' for new_with_strains()
   - Same fix for new() in second test

The LMDB_TEST_PATH feature now compiles correctly.
Fix two issues in LMDB integration tests:

1. Wrong strain names: Tests were using GTEx strains ('1_100_15')
   but HC_M2_0606_P has BXD strains. Now reads actual strains from
   LMDB metadata and uses first 6 for subset test.

2. LMDB lock conflict: Tests were failing with MDB_BAD_RSLOT because
   LmdbReader wasn't dropped before Compute opened new connection.
   Now explicitly drops reader and adds 100ms sleep between opens.

Also improved test diagnostics with eprintln output showing:
- Which strains are being used
- How many traits were correlated
- Output file validation

Tests are marked #[ignore] and require GSL + LMDB to run.
Add comprehensive documentation for running with cargo:

- Quick Start section with common cargo run commands
- Option B in Step 3: Run with Cargo (development)
- Environment variables (LMDB_TEST_PATH, RUST_LOG)
- Python one-liner to check available strains in LMDB

This helps developers test changes without recompiling
the full binary each time.
Replace GTEx strains (1_100_15, etc.) with proper BXD strain names:
- BXD1, BXD2, BXD5, BXD6, BXD8, BXD9
- BXD11, BXD12, BXD13, BXD14, BXD15, BXD16

HC_M2_0606_P is a Hippocampus Consortium M430v2 dataset
which uses BXD recombinant inbred mouse strains, not GTEx human samples.

Fixes 'Strain mismatch' error when running with actual dataset.
Add critical validation to ensure strains and x_vals arrays match:

1. correlations.rs: Check that strains.len() == x_vals.len()
   - Returns clear error if mismatch detected
   - Error message explains the requirement

2. input_lmdb.json: Add _comment explaining alignment requirement
   - strains[i] must correspond to x_vals[i]
   - Both arrays must have same length and order

This prevents silent errors from mismatched input data.
Example error:
  'Array length mismatch: 10 strains but 12 x_vals'

Total commits: 13
JSON strings cannot use backslash for line continuation.
Put _comment on single line to fix parse error.

Error was: 'invalid escape' at line 3
Add debugging output to strain mismatch error to show
first 10 available strains from LMDB metadata.

This helps diagnose which strains are actually in the dataset.
The actual strains in HC_M2_0606_P are simple numbers like '1', '2', '5',
NOT 'BXD1', 'BXD2'. This dataset uses numeric strain identifiers.

Valid strains (first 12): '1', '2', '5', '6', '8', '9', '11', '12', '14', '15', '17', '18'
Missing from original: '13', '16' do not exist in this dataset

Fixes 'Strain mismatch: requested 12 strains, found 10' error.
HC_M2_0606_P uses numeric strain IDs ('1', '2', '5'), not BXD names.
Update all documentation and examples to reflect this.

Files updated:
- tests/data/input_lmdb_full.json
- LMDB_USAGE.md (3 occurrences)

Note: Different datasets may use different strain naming conventions.
Always check actual strains with: guix shell python python-lmdb -- python3 -c ...
Update all examples to use descriptive strain names instead of IDs:

- input_lmdb.json: Use 'C57BL/6J', 'DBA/2J', 'BXD1', etc.
- input_lmdb_full.json: Same update with explanatory comment
- LMDB_USAGE.md: Update all strain examples to use names
- check_strains.py: New helper script to inspect LMDB strains

The _comment field in JSON explains that users should check
actual strain names in their LMDB after re-dumping with the
updated batch_lmdb_metadata.py script.

Example strain names are typical for mouse datasets:
- C57BL/6J (parental strain)
- DBA/2J (parental strain)
- BXD1, BXD2, etc. (recombinant inbred strains)
The LMDB was re-dumped with the updated batch_lmdb_metadata.py
and now contains descriptive strain names instead of IDs.

Updated strains from error message:
- BXD11, BXD12, BXD45, BXD48, BXD50, BXD51
- BXD55, B6D2F1, D2B6F1, PWK/PhJ, C57BL/6J, DBA/2J

Fixes 'Strain mismatch: requested 12 strains, found 11' error.
Add parallel processing support using rayon crate:

1. Cargo.toml: Add rayon = 1.7 dependency

2. parser.rs: Add 'parallel' field to JsonData (default: true)

3. correlations.rs:
   - Add parallel: bool field to Compute struct
   - Add new_with_strains_parallel() constructor
   - Implement both parallel (par_iter) and sequential versions
   - Collect all traits first, then process in parallel
   - Results are identical between modes

4. analysis.rs: Pass parallel flag from JSON to Compute

5. JSON examples: Add 'parallel': true flag

Testing:
- Sequential: 497ms
- Parallel: 502ms
- Output files: IDENTICAL (diff produces no output)

Note: Speedup depends on dataset size and CPU cores.
For 45K traits, overhead may offset gains on some systems.

Total commits: 14
Tested with 500K traits in release mode:

Parallel mode:   1.18s
Sequential mode: 1.74s
Speedup:         1.47x (47% faster)

Results identical between modes (verified with diff).

Release build is 3-5x faster than debug build.
Total speedup from debug sequential to release parallel: 4.8x
Tested with 2,000,000 traits in release mode:

Parallel:    5.32s
Sequential:  7.65s
Speedup:     1.44x (44% faster)

Output: 116MB file with 2 million correlations
Results verified identical between modes.

Performance scales well with dataset size:
- 500K traits: ~1.2s
- 2M traits:   ~5.3s (4.5x scaling)
- Add file_type field to JsonData with default 'csv'
- Update Analysis::compute to use file_type instead of auto-detection
- Default is 'csv' for backward compatibility
- Set to 'lmdb' for LMDB mode with strain extraction
- Gracefully drop strains not in dataset and continue
- Add LMDB_TEST_LOCK mutex for parallel test safety
- Use filtered_x_vals aligned with valid strains
- Warn about filtered strains to stderr
- input_self_correlation.json: validate r=1.0 for self-correlation
- input_missing_strains_*.json: test strain filtering behavior
- input_lmdb_all_strains.json: test with all 99 strains
- Filter NaN from x_vals at start (once, not 45k times)
- Validate sufficient samples after NaN filtering
- Clear error messages for insufficient valid data
- Remove redundant x-NaN checks from inner loops
- Warn user about filtered samples

Improves performance and correctness: ~50% fewer NaN checks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant