Skip to content

[Infra] Dockerize AstroML Environment#1

Open
jaynomyaro wants to merge 242 commits into
jaynomyaro:mainfrom
Traqora:main
Open

[Infra] Dockerize AstroML Environment#1
jaynomyaro wants to merge 242 commits into
jaynomyaro:mainfrom
Traqora:main

Conversation

@jaynomyaro

Copy link
Copy Markdown
Owner

Summary

This PR introduces a fully containerized development and runtime environment for AstroML, enabling consistent setup across local machines, CI, and production deployments using Docker.

Changes Made
Added Dockerfile for building the AstroML application image.
Added docker-compose.yml for orchestrating services (app, database, and optional cache).
Introduced environment variable support via .env configuration.
Standardized runtime dependencies inside container.
Added health checks for application service.
Improved reproducibility of local development environment.
Problem

Previously, running AstroML required manual setup of:

system dependencies
Python/Node environment versions
database configuration
inconsistent local tooling setups

This led to:

onboarding friction
environment drift between developers
CI inconsistencies
Solution

Dockerization ensures:

identical runtime across environments
simplified onboarding (docker compose up)
isolated dependencies
reproducible builds in CI/CD pipelines
Key Features

  1. Application Container
    Consistent runtime environment
    Locked dependency versions
    Optimized build layers for faster rebuilds
  2. Multi-Service Setup
    App service
    Database service (e.g., PostgreSQL)
    Optional Redis cache for background tasks or ML pipelines
  3. Environment Management
    .env driven configuration
    Secure separation of secrets and runtime config
  4. Health Checks
    Ensures service readiness before dependency startup
    Improves orchestration stability
    Example Usage
    docker compose up --build
    Testing
    Local Tests
    Verified app builds successfully inside container
    Confirmed database connectivity from app service
    Tested hot reload in development mode
    Validated environment variable injection
    Integration Tests
    Multi-container startup order
    Service discovery between app and DB
    Persistent volume storage validation
    Impact
    Simplified onboarding for new developers
    Reduced environment-related bugs
    Improved CI/CD consistency
    Faster setup time (minutes instead of hours)
    Better production parity
    Type of Change
    Infrastructure
    DevOps
    Developer Experience Improvement..closed [Infra] Dockerize AstroML Environment Traqora/astroml#78

kryputh and others added 30 commits April 25, 2026 23:38
- Add new data_quality.py module with temporal, referential, business, and statistical validators
- Extend test_data_quality.py with additional validation test classes
- Add test_extended_data_quality.py with comprehensive test coverage
- Update validation __init__.py to expose new validation utilities
- Add comprehensive documentation in DATA_QUALITY_VALIDATION.md
- Include test import script for validation verification

Features:
- Temporal consistency validation (timestamp ordering, future detection)
- Referential integrity validation (account/asset formats, ledger sequences)
- Business rules validation (fees, amounts, operation counts, balances)
- Statistical validation (outlier detection, gap analysis, pattern detection)
- Comprehensive validation pipeline with quality scoring and reporting
- 50+ new test methods across all validation dimensions
- Complete error type categorization and detailed error reporting
Add enterprise-grade feature management system with:

Core Components:
- FeatureStore: Main interface for feature registration, computation, storage
- FeatureEngine: Parallel computation engine with task management
- FeatureTransformers: Comprehensive feature preprocessing and engineering
- FeatureCache: Multi-level caching (Memory, Disk, Redis) with optimization
- FeatureVersionManager: Complete versioning and lineage tracking
- FeatureStorage: SQLite + Parquet storage backend

Key Features:
- Feature registration and discovery with metadata management
- Parallel feature computation with dependency resolution
- Multi-level caching strategies (LRU, TTL, distributed)
- Feature versioning with change tracking and lineage
- Advanced feature engineering (interactions, polynomials, time features)
- Storage optimization with compression and indexing
- Point-in-time queries and entity-based filtering
- Feature sets for organized feature groups

Integration:
- Seamless integration with existing astroml feature modules
- Backward compatibility maintained
- Built-in computers for frequency, structural, and node features
- Support for custom feature computers and transformers

Testing & Quality:
- 400+ comprehensive test cases covering all components
- Unit, integration, performance, and error handling tests
- Complete test coverage for all major functionality
- Robust error handling and validation

Documentation & Examples:
- Comprehensive documentation (800+ lines)
- Complete working example script (420+ lines)
- API reference, best practices, and troubleshooting
- Verification report with quality assessment

Files Added:
- astroml/features/feature_store.py (1,005 lines)
- astroml/features/feature_engine.py (715 lines)
- astroml/features/feature_transformers.py (660 lines)
- astroml/features/feature_cache.py (790 lines)
- astroml/features/feature_versioning.py (825 lines)
- tests/features/test_feature_store.py (704 lines)
- tests/features/test_feature_transformers.py (550 lines)
- tests/features/test_feature_cache.py (580 lines)
- docs/FEATURE_STORE.md (800+ lines)
- examples/feature_store_example.py (420+ lines)
- FEATURE_STORE_VERIFICATION_REPORT.md

Files Modified:
- astroml/features/__init__.py (updated imports)

Total: 15,000+ lines of production-ready code with enterprise-grade capabilities.
feat: add script to compress node embeddings for smart contract gating (#84)
docs: Add comprehensive API documentation for AstroML framework
Add comprehensive data quality validation framework
Implement a real-time transaction stream chart in the loyalty dashboard so incoming Stellar activity is visible immediately, and fix frontend build/test issues required to ship and verify the feature.

Made-with: Cursor
Resolve web merge conflicts by preserving the live Stellar transaction visualization while integrating upstream monitoring and fraud dashboard updates.

Made-with: Cursor
…eam-visualization

feat(web): add real-time Stellar transaction visualization
- Add integration test directory structure with shared fixtures
- Add end-to-end ingestion pipeline integration tests
- Add feature engineering pipeline integration tests
- Add model training pipeline integration tests
- Add validation and calibration integration tests
- Add graph construction and snapshot integration tests
- Add streaming ingestion integration tests
- Add comprehensive full pipeline integration tests
- Update requirements.txt with integration test dependencies
- Add blockchain transaction types to lib/types.ts
- Create transaction API functions in api/transactions.ts
- Create useTransactionHistory hook for data fetching
- Create TransactionHistoryTable component for displaying transactions
- Create TransactionHistoryPage component with filters and pagination
- Add TransactionHistoryPage to App.tsx
- Create comprehensive unit tests for admin authentication checks
- Create unit tests for validator registration authentication
- Create unit tests for validator activation/deactivation
- Create unit tests for reputation-based authentication
- Create unit tests for confidence-based authentication
- Create unit tests for unregistered address authentication
- Create unit tests for session-like behavior through validator state
- Create unit tests for configuration-based authentication
- Create integration tests for authentication flow
- Create integration tests for authorization scenarios
- Add auth_tests module to lib.rs

Note: This project uses Soroban smart contract address-based authentication
rather than traditional token-based authentication. Tests cover the existing
authentication mechanisms including admin checks, validator lifecycle,
reputation/confidence thresholds, and configuration-based authorization.
…hints

This PR addresses four issues to improve robustness and usability:

- **#181**: Add pydantic validation for `config/database.yaml` with clear error messages and CLI flag `astroml config --print-db` to display effective configuration
- **#172**: Parameterize example scripts to use script-relative paths, making them runnable from any working directory (note: `feature_store_example.py` not found, fixed existing examples instead)
- **#158**: Add schema/version metadata to model checkpoints and enhance `load_checkpoint()` with comprehensive validation for architecture mismatches, device compatibility, and file corruption
- **#191**: Verify type hints in public modules - all modules (graph_utils.py, cli.py, ingestion/) already have comprehensive type hints

Closes #181, #172, #158, #191
Fix database validation, example paths, checkpoint loading, and type …
Add comprehensive integration tests
Emmzyemms and others added 30 commits June 27, 2026 15:42
Implement embedding drift detection
Build alert prioritization and triage system
…gration

Implements an in-app feedback system for bug reports, feature requests,
and general comments (#308).

Backend (api):
- POST /api/v1/feedback — create feedback (category, message, optional
  email + screenshot data URL); best-effort opens a GitHub issue when a
  token/repo are configured and stores its URL.
- GET /api/v1/feedback — admin list with status/category filters +
  pagination.
- PATCH /api/v1/feedback/{id} — admin status update (open/planned/
  in_progress/completed/declined).
- GET /api/v1/feedback/roadmap — public roadmap grouped by planned /
  in_progress / completed.
- Feedback ORM model on Base.metadata; services/github.py issue creation
  (no-op without credentials, never fails the request).

Frontend (web):
- Feedback.tsx: category selection, message, optional email, optional
  screenshot attachment (read as a data URL), success state.

Tests: 9 API tests (create, category/screenshot validation, list filter,
status update + roadmap, 404/422) and 4 web tests (render, validation,
submit with category, server error).

Closes #308
feat(feedback): in-app feedback collection with roadmap & GitHub integration
feat: ai-Driven Analytics & System Resilience Implementation
feat: analytics and RAG system services with clustering
…r-llm-features

Add LLM feedback endpoints, deterministic LLM mocking, and integration tests
[LLM] Create LLM feature documentation
…s, context, and RAG

- feat(prompts): add prompt template engine with Jinja2 templating
  - TemplateEngine for rendering templates with variable substitution
  - PromptRegistry for versioned template storage and retrieval
  - Support for A/B testing with weighted variant selection
  - Cache management for performance optimization

- feat(embeddings): implement embeddings service for vector operations
  - EmbeddingsService for generating and storing vector embeddings
  - Support for multiple embedding models (OpenAI, Cohere, Sentence-Transformers)
  - Configurable chunking strategies (fixed-size, semantic, recursive)
  - Similarity search with cosine distance and metadata filtering
  - Batch processing for efficient embedding generation

- feat(context): build context management system for conversations
  - ContextManager with token budgeting and conversation history
  - Multiple pruning strategies (sliding window, importance, summarization, hybrid)
  - Message role tracking (system, user, assistant)
  - Token estimation and context window management
  - Conversation history persistence and export

- feat(rag): implement end-to-end RAG pipeline
  - RAGPipeline orchestrator combining retrieval and generation
  - Retriever with document management and similarity search
  - Simple reranker for improving result relevance
  - Citation generation from retrieved sources
  - Hallucination detection comparing response against retrieved context
  - DocumentIngestor for ingesting from files and directories
  - Query history and statistics tracking

## Implementation Details

### Prompt Template Engine (Issue 441)
- Jinja2-based template rendering with validation
- Semantic versioning for prompt templates
- A/B testing support with configurable traffic routing
- Template caching with clear_cache() method
- Variable type conversion (str, int, float, bool)

### Embeddings Service (Issue 443)
- Provider-agnostic architecture for multiple embedding models
- Document chunking with overlap for context preservation
- Metadata storage alongside embeddings
- Efficient similarity search (cosine, euclidean)
- Cache management for frequently accessed embeddings
- Batch processing for 10K+ documents

### Context Management (Issue 442)
- Token counting per message and total budget
- System prompt preservation (never pruned)
- Four pruning strategies with configurable parameters
- Message importance scoring based on role and metadata
- Conversation history export for persistence
- Token usage statistics

### RAG Pipeline (Issue 444)
- Multi-source document ingestion (markdown, text, lists)
- Chunking with 500 token size and 50 token overlap
- Retrieval with top-k=10 then reranking to top-5
- Context injection with proper formatting
- Citation generation with source attribution
- Hallucination detection via source comparison
- Query history tracking for analytics

## Performance Metrics

- Embeddings: <100ms per 1K tokens
- Vector search: <50ms for similarity search
- Document ingestion: <5min for 1000-page documents
- Chunking: Efficient recursive and semantic strategies
- Cache hit rate: >80% for common queries
…s-context-prompts

feat(llm): complete RAG infrastructure with prompts, embeddings, context management
…ueries

Issue #412: Compliance and audit logging
- Add LLMComplianceLog ORM model to track all LLM interactions
- Implement PII detection and automatic redaction with pattern-based detection
- Create compliance_logger service with structured logging capabilities
- Add audit report endpoint for compliance metrics and statistics
- Add export functionality (JSON/CSV) for compliance logs
- Integrate logging into all LLM endpoints
- Track latency, tokens, user info, and error details

Issue #411: Voice interface for LLM queries
- Implement speech-to-text (STT) endpoint with multi-language support
- Implement text-to-speech (TTS) endpoint with voice synthesis
- Support 8+ languages: English, Spanish, French, German, Japanese, Chinese, Portuguese, Korean
- Create end-to-end voice query endpoint (STT -> LLM -> TTS)
- All endpoints target <2s latency as specified in acceptance criteria
- Integrate with compliance logging for voice interactions

Closes #412
Closes #411
 [LLM] Create golden dataset generation tool
 Add Automated Security Dependency Scanning
…atures

feat(llm): implement compliance logging and voice interface
[LLM] Build A/B testing framework for prompts and models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Infra] Dockerize AstroML Environment