Add enterprise analytics tutorial#44
Conversation
SammyOina
left a comment
There was a problem hiding this comment.
include prism screenshots, and cli output as well
| - **Secure Computation (aTLS)** — Attested TLS verifies the TEE hardware and software stack before any data is uploaded | ||
| - **Multi-Party Computation** — Three independent data providers each upload proprietary datasets into the same encrypted enclave | ||
| - **Real-World Data** — Uses the [UCI Online Retail II](https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci) dataset (real European e-commerce transactions) split across simulated companies | ||
| - **Enterprise Value** — Benchmark proves the consortium model outperforms any single-company model |
There was a problem hiding this comment.
In practice this is not guaranteed. In my local run, the consortium model achieved lower R² than at least one of the individual models. The claim should be softened.
| To train the consortium model locally: | ||
|
|
||
| ```bash | ||
| python train.py |
There was a problem hiding this comment.
This step is sensitive to the current working directory. Running train.py from the repository root fails because it expects datasets/ relative to CWD. This is not clearly documented and makes the workflow non-intuitive.
| To train the consortium model locally: | ||
|
|
||
| ```bash | ||
| python train.py |
There was a problem hiding this comment.
This step is sensitive to the current working directory. Running train.py from the repository root fails because it expects datasets/ relative to CWD. This is not clearly documented and makes the workflow non-intuitive.
| ```bash | ||
| python3 -m venv venv | ||
| source venv/bin/activate | ||
| pip install -r requirements.txt |
There was a problem hiding this comment.
The example references a requirements.txt, but no such file is present in the repository. This makes the setup incomplete.
|
|
||
| ## Install | ||
|
|
||
| Fetch the data from Kaggle — [Online Retail II UCI](https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci) dataset: |
There was a problem hiding this comment.
The setup assumes Kaggle CLI + credentials, but does not clearly explain the need for a legacy API key (kaggle.json). This can be confusing with the newer Kaggle token system.
| X = combined[FEATURE_COLS].values | ||
| y = combined[TARGET_COL].values | ||
|
|
||
| X_train, X_test, y_train, y_test = train_test_split( |
There was a problem hiding this comment.
Since this is presented as a demand forecasting example, a time-based split would likely be more appropriate than a random train/test split for evaluation. That would better reflect how the model performs on future periods rather than shuffled observations from the same overall time range.
| X = combined[FEATURE_COLS].values | ||
| y = combined[TARGET_COL].values | ||
|
|
||
| X_train, X_test, y_train, y_test = train_test_split( |
There was a problem hiding this comment.
Since this is presented as a demand forecasting example, a time-based split would likely be more appropriate than a random train/test split for evaluation. That would better reflect how the model performs on future periods rather than shuffled observations from the same overall time range.
| X = combined[FEATURE_COLS].values | ||
| y = combined[TARGET_COL].values | ||
|
|
||
| X_train, X_test, y_train, y_test = train_test_split( |
There was a problem hiding this comment.
Since this is presented as a demand forecasting example, a time-based split would likely be more appropriate than a random train/test split for evaluation. That would better reflect how the model performs on future periods rather than shuffled observations from the same overall time range.
| df["WeekOfYear"] = df["InvoiceDate"].dt.isocalendar().week.astype(int) | ||
|
|
||
| # Aggregate to monthly product-level demand | ||
| monthly = ( |
There was a problem hiding this comment.
The code comment mentions "monthly product-level demand", but including WeekOfYear in the grouping means the aggregation is no longer purely monthly. This effectively splits data within the same month, so it may be worth either adjusting the grouping or clarifying the description.
| """ | ||
|
|
||
| import os | ||
| import sys |
There was a problem hiding this comment.
The sys import appears to be unused and can likely be removed.
fbugarski
left a comment
There was a problem hiding this comment.
Thanks for the detailed example — I was able to run the full pipeline locally end-to-end.
Overall, the example works well, but there are a few areas that would benefit from clarification or adjustment:
- some claims around consortium performance are stronger than what is observed in practice
- the local workflow is sensitive to the current working directory (datasets path)
- setup steps (requirements and Kaggle credentials) could be made more explicit
- a few minor inconsistencies between code and documentation (e.g. aggregation wording)
What type of PR is this?
What does this do?
Which issue(s) does this PR fix/relate to?
Have you included tests for your changes?
Did you document any new/modified features?
Notes