Skip to content
View sandeep-jay's full-sized avatar

Highlights

  • Pro

Organizations

@Apereo-Learning-Analytics-Initiative

Block or report sandeep-jay

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
sandeep-jay/README.md

Sandeep Jayaprakash

Data & AI Platform Architect & Engineering Leader

I build governed data and AI platforms — lakehouse architecture, RAG and LLM systems, and responsible-AI workflows. 15+ years architecting the data infrastructure behind teaching, advising, and research at UC Berkeley.

Open to senior data & AI architecture, engineering, data science, and leadership roles.


Focus areas

Lakehouse & Data Architecture · RAG / LLM Systems · MLOps & Observability · Responsible AI & Governance · AWS · Azure · Python · Spark

Projects & reference implementations

I build end-to-end reference implementations — each takes one architecture or technique from raw data through to a working, governed system. Together they trace one arc: prototype the product → industrialize the data foundation → build the AI systems on top of it.

Flagship builds

  • Campus RAG Assistant — production-style enterprise RAG platform with a pluggable, multi-cloud LLM + RAG + vector-store provider registry supporting mixed deployments: (AWS Bedrock + OpenSearch) / (Azure OpenAI + AI Search) / mock mode. LangGraph orchestration, RAGAS evaluation, LangSmith tracing, and full CI/CD via GitHub Actions. Next.js + FastAPI. Documentation →
  • Scribe IQ Lakehouse — production-pattern healthcare data lakehouse on Synthea Coherent (FHIR R4). A Bronze → Silver → Gold medallion built twice
    1. Polars + delta-rs + DuckDB on a local/laptop orchestrated with Dagster, emitting one versioned, test-gated Gold data contract
    2. Microsoft Fabric-native Spark + Delta + Pipelines — Two independent, engine-native implementations converge on the same contract by schema parity, not shared code. Documentation →
  • Scribe IQ — governed clinical-documentation AI: note-grounded RAG with enforced citations, structured note generation, and a first-class audit trail. Built entirely on synthetic data (Synthea + public clinical-note datasets). The lakehouse above is the principled rebuild of its data foundation. Documentation →
  • Fabric HLS Readmission Lakehouse (landing this week) — Microsoft Fabric-native medallion lakehouse and ML scoring for synthetic hospital-readmission analytics, with an explicit Databricks-to-Fabric pattern mapping. Code is ready; the public repository goes up shortly.

Open source & community

UC Berkeley Educational Technology Services (@ets-berkeley-edu) Contributor to the campus data and learning-platform ecosystem:

  • boac — Berkeley Online Advising (BOA), the award-winning academic advising platform
  • nessie — data pipeline and analytics engine
  • chabot — Lake Chabot, an RTL GenAI chatbot platform for support use cases (limited pilot)
  • data-loch — AWS data lake infrastructure for learning data
  • cloud-lrs — cloud-based Learning Record Store
  • cloudlrs-ingest-microservice — learning-events processing microservice for the LRS
  • bcourses-chatbot-poc — GenAI support chatbot for internal training/tutorial purposes

Apereo Learning Analytics Initiative Analytics Liaison and Community Coordinator on the Apereo Foundation Learning Analytics Initiative; contributor to the open learning-analytics ecosystem:

  • LearningAnalyticsProcessor — an open-source, Java-based analytics workflow manager running Pentaho-based data-integration + ML pipelines; the first automation of OAAI research.

Background

  • 15+ years of experience in Higher Education
  • ~10 years at UC Berkeley, building governed cloud-native data and AI/ML platforms — the foundational RTL Data Lake and the data systems behind the award-winning Berkeley Online Advising
  • Lead Architect on enterprise data lakes & lakehouse, and built data-mesh architectures supporting domain ownership and seamless connectivity across campus
  • Developed a high-throughput multi-tenant streaming platform processing ~5M events/day on average
  • Data Science and ML/NLP work supporting research enablement — built an MLOps pipeline for reproducible research
  • Now working on the campus's governed GenAI initiatives, building production-style knowledge assistants grounded in institutional data with Responsible-AI audit and provenance trails
  • Earlier, at Marist University, led the Gates Foundation-funded Open Academic Analytics Initiative, working with principal investigators to build open-source academic early-alert risk models
  • Scaled that research to multi-institution production deployments
  • Ten peer-reviewed publications in Learning Analytics and Educational Data Mining

Connect

LinkedIn · Google Scholar

Pinned Loading

  1. campus-rag-assistant campus-rag-assistant Public

    Production-minded multicloud RAG + agentic helpdesk platform for governed campus knowledge: LangGraph orchestration, AWS Bedrock / Azure AI Search providers, cited answers, HITL ticket filing, RAGA…

    Python

  2. scribe-iq scribe-iq Public

    Grounded clinical documentation AI prototype built on a synthetic Synthea patient spine, public clinical note corpora, RAG, pgvector, FastAPI, Next.js, AWS/Azure LLM providers, and governed LLM aud…

    Python

  3. scribe-iq-lakehouse scribe-iq-lakehouse Public

    Production-pattern healthcare data lakehouse: a Bronze→Silver→Gold medallion over Synthea Coherent FHIR, built twice — Polars + delta-rs locally and Spark + Delta on Microsoft Fabric — orchestrated…

    Python

  4. boac boac Public

    Forked from ets-berkeley-edu/boac

    Berkeley Online Advising (BOA) ✈️

    Python

  5. nessie nessie Public

    Forked from ets-berkeley-edu/nessie

    Networked engines supply statistics in education.

    Python

  6. ets-berkeley-edu/data-loch ets-berkeley-edu/data-loch Public

    AWS Data Lake infrastructure to store and process Learning Data

    JavaScript 5 6