Skip to content

Add enterprise analytics tutorial#44

Open
smithjilks wants to merge 3 commits intoultravioletrs:mainfrom
smithjilks:feat-enterprise-tutorial
Open

Add enterprise analytics tutorial#44
smithjilks wants to merge 3 commits intoultravioletrs:mainfrom
smithjilks:feat-enterprise-tutorial

Conversation

@smithjilks
Copy link
Copy Markdown
Contributor

What type of PR is this?

What does this do?

Which issue(s) does this PR fix/relate to?

Have you included tests for your changes?

Did you document any new/modified features?

Notes

@smithjilks smithjilks requested a review from SammyOina March 26, 2026 09:00
@smithjilks smithjilks self-assigned this Mar 26, 2026
@SammyOina SammyOina requested a review from fbugarski March 27, 2026 11:20
Copy link
Copy Markdown
Contributor

@SammyOina SammyOina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include prism screenshots, and cli output as well

- **Secure Computation (aTLS)** — Attested TLS verifies the TEE hardware and software stack before any data is uploaded
- **Multi-Party Computation** — Three independent data providers each upload proprietary datasets into the same encrypted enclave
- **Real-World Data** — Uses the [UCI Online Retail II](https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci) dataset (real European e-commerce transactions) split across simulated companies
- **Enterprise Value** — Benchmark proves the consortium model outperforms any single-company model
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice this is not guaranteed. In my local run, the consortium model achieved lower R² than at least one of the individual models. The claim should be softened.

To train the consortium model locally:

```bash
python train.py
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step is sensitive to the current working directory. Running train.py from the repository root fails because it expects datasets/ relative to CWD. This is not clearly documented and makes the workflow non-intuitive.

To train the consortium model locally:

```bash
python train.py
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step is sensitive to the current working directory. Running train.py from the repository root fails because it expects datasets/ relative to CWD. This is not clearly documented and makes the workflow non-intuitive.

```bash
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example references a requirements.txt, but no such file is present in the repository. This makes the setup incomplete.


## Install

Fetch the data from Kaggle — [Online Retail II UCI](https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci) dataset:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup assumes Kaggle CLI + credentials, but does not clearly explain the need for a legacy API key (kaggle.json). This can be confusing with the newer Kaggle token system.

X = combined[FEATURE_COLS].values
y = combined[TARGET_COL].values

X_train, X_test, y_train, y_test = train_test_split(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is presented as a demand forecasting example, a time-based split would likely be more appropriate than a random train/test split for evaluation. That would better reflect how the model performs on future periods rather than shuffled observations from the same overall time range.

X = combined[FEATURE_COLS].values
y = combined[TARGET_COL].values

X_train, X_test, y_train, y_test = train_test_split(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is presented as a demand forecasting example, a time-based split would likely be more appropriate than a random train/test split for evaluation. That would better reflect how the model performs on future periods rather than shuffled observations from the same overall time range.

X = combined[FEATURE_COLS].values
y = combined[TARGET_COL].values

X_train, X_test, y_train, y_test = train_test_split(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is presented as a demand forecasting example, a time-based split would likely be more appropriate than a random train/test split for evaluation. That would better reflect how the model performs on future periods rather than shuffled observations from the same overall time range.

df["WeekOfYear"] = df["InvoiceDate"].dt.isocalendar().week.astype(int)

# Aggregate to monthly product-level demand
monthly = (
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code comment mentions "monthly product-level demand", but including WeekOfYear in the grouping means the aggregation is no longer purely monthly. This effectively splits data within the same month, so it may be worth either adjusting the grouping or clarifying the description.

"""

import os
import sys
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sys import appears to be unused and can likely be removed.

Copy link
Copy Markdown

@fbugarski fbugarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed example — I was able to run the full pipeline locally end-to-end.

Overall, the example works well, but there are a few areas that would benefit from clarification or adjustment:

  • some claims around consortium performance are stronger than what is observed in practice
  • the local workflow is sensitive to the current working directory (datasets path)
  • setup steps (requirements and Kaggle credentials) could be made more explicit
  • a few minor inconsistencies between code and documentation (e.g. aggregation wording)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants