I build governed data and AI platforms — lakehouse architecture, RAG and LLM systems, and responsible-AI workflows. 15+ years architecting the data infrastructure behind teaching, advising, and research at UC Berkeley.
Open to senior data & AI architecture, engineering, data science, and leadership roles.
Lakehouse & Data Architecture · RAG / LLM Systems · MLOps & Observability ·
Responsible AI & Governance · AWS · Azure · Python · Spark
I build end-to-end reference implementations — each takes one architecture or technique from raw data through to a working, governed system. Together they trace one arc: prototype the product → industrialize the data foundation → build the AI systems on top of it.
Flagship builds
- Campus RAG Assistant — production-style enterprise RAG platform with a pluggable, multi-cloud LLM + RAG + vector-store provider registry supporting mixed deployments: (AWS Bedrock + OpenSearch) / (Azure OpenAI + AI Search) / mock mode. LangGraph orchestration, RAGAS evaluation, LangSmith tracing, and full CI/CD via GitHub Actions. Next.js + FastAPI. Documentation →
- Scribe IQ Lakehouse —
production-pattern healthcare data lakehouse on Synthea Coherent (FHIR R4). A
Bronze → Silver → Gold medallion built twice —
- Polars + delta-rs + DuckDB on a local/laptop orchestrated with Dagster, emitting one versioned, test-gated Gold data contract
- Microsoft Fabric-native Spark + Delta + Pipelines — Two independent, engine-native implementations converge on the same contract by schema parity, not shared code. Documentation →
- Scribe IQ — governed clinical-documentation AI: note-grounded RAG with enforced citations, structured note generation, and a first-class audit trail. Built entirely on synthetic data (Synthea + public clinical-note datasets). The lakehouse above is the principled rebuild of its data foundation. Documentation →
- Fabric HLS Readmission Lakehouse (landing this week) — Microsoft Fabric-native medallion lakehouse and ML scoring for synthetic hospital-readmission analytics, with an explicit Databricks-to-Fabric pattern mapping. Code is ready; the public repository goes up shortly.
UC Berkeley Educational Technology Services (@ets-berkeley-edu) Contributor to the campus data and learning-platform ecosystem:
- boac — Berkeley Online Advising (BOA), the award-winning academic advising platform
- nessie — data pipeline and analytics engine
- chabot — Lake Chabot, an RTL GenAI chatbot platform for support use cases (limited pilot)
- data-loch — AWS data lake infrastructure for learning data
- cloud-lrs — cloud-based Learning Record Store
- cloudlrs-ingest-microservice — learning-events processing microservice for the LRS
- bcourses-chatbot-poc — GenAI support chatbot for internal training/tutorial purposes
Apereo Learning Analytics Initiative Analytics Liaison and Community Coordinator on the Apereo Foundation Learning Analytics Initiative; contributor to the open learning-analytics ecosystem:
- LearningAnalyticsProcessor — an open-source, Java-based analytics workflow manager running Pentaho-based data-integration + ML pipelines; the first automation of OAAI research.
- 15+ years of experience in Higher Education
- ~10 years at UC Berkeley, building governed cloud-native data and AI/ML platforms — the foundational RTL Data Lake and the data systems behind the award-winning Berkeley Online Advising
- Lead Architect on enterprise data lakes & lakehouse, and built data-mesh architectures supporting domain ownership and seamless connectivity across campus
- Developed a high-throughput multi-tenant streaming platform processing ~5M events/day on average
- Data Science and ML/NLP work supporting research enablement — built an MLOps pipeline for reproducible research
- Now working on the campus's governed GenAI initiatives, building production-style knowledge assistants grounded in institutional data with Responsible-AI audit and provenance trails
- Earlier, at Marist University, led the Gates Foundation-funded Open Academic Analytics Initiative, working with principal investigators to build open-source academic early-alert risk models
- Scaled that research to multi-institution production deployments
- Ten peer-reviewed publications in Learning Analytics and Educational Data Mining




