diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index 0248e3d3..636e26b4 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -9,6 +9,10 @@ If you have any questions or issues, please [let us know](https://github.com/flu The following tutorials are provided from their respective directories (and are not documented here): +### Machine Learning + + - [Foundry ML](https://github.com/flux-framework/flux-operator/tree/main/examples/machine-learning/foundry) + ### Simulations - [Laghos](https://github.com/flux-framework/flux-operator/tree/main/examples/simulations/laghos) diff --git a/examples/machine-learning/foundry/README.md b/examples/machine-learning/foundry/README.md new file mode 100755 index 00000000..36f1bf16 --- /dev/null +++ b/examples/machine-learning/foundry/README.md @@ -0,0 +1,143 @@ +# Foundry + +This tutorial example will show using [Foundry](https://github.com/MLMI2-CSSI/foundry) to download a dataset and run an example. + +## Credentials + +You'll need to generate a credential file that we will provide to the job. This needs +to be done locally (it's recommended to make a Python environment): + +```bash +$ python -m venv env +$ source env/bin/activate +$ pip install foundry_ml +``` + +You'll next want to login with Globus. Yes, this requires an account! I was able +to just login via my institution. Running this command should open a web interface +to authenticate: + +```bash +$ python -c "from foundry import Foundry; f = Foundry()" +``` +This will generate a credential file in your home directory - let's copy it +here so we can provide it to the minicluster (do NOT add to git!) + +```bash +$ cp ~/.globus-native-apps.cfg . +``` + +## Kind Cluster + +We will want to bind the present working directory (with the examples) to our MiniCluster, +and that is easy to do with kind. Create a cluster with the included kind config. +Make sure this is run in the directory context here! + +```bash +$ kind create cluster --config kind-config.yaml +``` + +## Create MiniCluster + +Since we have several examples, let's create an interactive cluster so we can run (and watch them run) with flux submit. +If you were doing this at scale you would likely choose one workflow and run headlessly by removing `interactive: true` +and providing the [minicluster.yaml](minicluster.yaml) with a command. Let's create the namespace and install +the operator: + +```bash +$ kubectl create namespace flux-operator +$ kubectl apply -f ../../../examples/dist/flux-operator-dev.yaml +``` + +And then create the interactive cluster: + +```bash +$ kubectl apply -f minicluster.yaml +``` + +See pods creating: + +```bash +$ kubectl get -n flux-operator pods +``` + +When the broker (index 0) is running, shell in! + +```bash +$ kubectl exec -it -n flux-operator flux-sample-0-fzml6 bash +``` + +You'll want to connect to the broker. + +```bash +$ sudo -E $(env) -E HOME=/home/fluxuser -u fluxuser flux proxy local:///run/flux/local bash +``` + +### Globus Credentials + +Export your Globus credentials (I'm not convinced this is necessary, but the testing example does it, so why not) + +```bash +$ export GLOBUS_CONFIG=$(cat .globus-native-apps.cfg) +``` + +### Run Examples + +Now let's cd into the examples directory and run a few! We will run these on the node, but they could +also be run with `flux submit --watch` and a certain number of nodes `-n` + +```bash +$ cd ./examples +$ ls +``` +```console +atom-position-finding bandgap dendrite-segmentation g4mp2-solvation oqmd publishing-guides qmc_ml zeolite +``` + +#### Atom Position Finding + +These interactions are run from inside the container: + +```bash +$ cd ./atom-position-finding +$ python atom_position_finding.py +``` + +![./examples/atom-position-finding/result.png](./examples/atom-position-finding/result.png) + + +### Bandgap + +Note that downloading the data on this one froze my computer the first time, so be careful! + +```bash +$ cd ./bandgap +$ python bandgap_demo.py +``` + +![./examples/bandgap/result.png](./examples/bandgap/result.png) + +### QMC ML + +Note that downloading the data on this one froze my computer the first time, so be careful! + +```bash +$ cd ./qmc_ml +$ python qmc_ml.py +``` + +![./examples/qmc_ml/result.png](./examples/qmc_ml/result.png) + + +And finally, clean up: + +```bash +$ kubectl delete -f minicluster.yaml +``` + +It's not clear yet how these machine learning runs can best integrate with flux, beyond submitting a job +to Flux. We will need to think about this. One design, however, I think could work really nicely here is: + +1. Use Foundry for storing data, download a dataset via the broker pre command. +2. Use flux filemap in the batch script (with batch:true and batchRaw: true) to map the data to nodes +3. Run some job that uses the data across the nodes (e.g., MPI or similar) \ No newline at end of file diff --git a/examples/machine-learning/foundry/examples/atom-position-finding/atom_position_finding.py b/examples/machine-learning/foundry/examples/atom-position-finding/atom_position_finding.py new file mode 100755 index 00000000..dfc1b8c9 --- /dev/null +++ b/examples/machine-learning/foundry/examples/atom-position-finding/atom_position_finding.py @@ -0,0 +1,48 @@ +#!/usr/bin/env python +# coding: utf-8 + +# # Installing Foundry +# First we'll need to install Foundry. We'll also be installing [Matplotlib](https://matplotlib.org/) for our visualizations. If you're using Google Colab, this code block will install this package into the Colab environment. +# +# +# If you are running locally, it will install this module onto your machine if you do not already have it. We also have a [requirements file](https://github.com/MLMI2-CSSI/foundry/tree/main/examples/atom-position-finding) included with this notebook. You can run `pip install -r requirements.txt` in your terminal to set up your environment locally. + + +# # Importing Packages +# Now we can import Foundry and Matplotlib so we can import the data and visualize it. + +# In[9]: + + +from foundry import Foundry +import matplotlib.pyplot as plt + +# # Instantiating Foundry +# To instantiate Foundry, you'll need a [Globus](https://www.globus.org) account. Once you have your account, you can instantiate Foundry using the code below. When you instantiate Foundry locally, be sure to have your Globus endpoint turned on (you can do that with [Globus Connect Personal](https://www.globus.org/globus-connect-personal)). When you instantiate Foundry on Google Colab, you'll be given a link in the cell's output and asked to enter the provided auth code. + +f = Foundry(index="mdf", no_local_server=True, no_browser=True) + +dataset_doi = '10.18126/e73h-3w6n' + +# download the data +f.load(dataset_doi, download=True, globus=False) + +# load the HDF5 image data into a local object +res = f.load_data() + +# using the 'train' split, 'input' or 'target' type, and Foundry Keys specified by the dataset publisher +# we can grab the atom images, metadata, and coorinates we desire +imgs = res['train']['input']['imgs'] +desc = res['train']['input']['metadata'] +coords = res['train']['target']['coords'] + +n_images = 3 +offset = 150 +key_list = list(res['train']['input']['imgs'].keys())[0+offset:n_images+offset] + +fig, axs = plt.subplots(1, n_images, figsize=(20,20)) +for i in range(n_images): + axs[i].imshow(imgs[key_list[i]]) + axs[i].scatter(coords[key_list[i]][:,0], coords[key_list[i]][:,1], s = 20, c = 'r', alpha=0.5) + +fig.savefig("result.png") \ No newline at end of file diff --git a/examples/machine-learning/foundry/examples/atom-position-finding/result.png b/examples/machine-learning/foundry/examples/atom-position-finding/result.png new file mode 100644 index 00000000..3a6dd8f1 Binary files /dev/null and b/examples/machine-learning/foundry/examples/atom-position-finding/result.png differ diff --git a/examples/machine-learning/foundry/examples/bandgap/bandgap_demo.py b/examples/machine-learning/foundry/examples/bandgap/bandgap_demo.py new file mode 100755 index 00000000..0843aaa3 --- /dev/null +++ b/examples/machine-learning/foundry/examples/bandgap/bandgap_demo.py @@ -0,0 +1,168 @@ +#!/usr/bin/env python +# coding: utf-8 + +# + +# # Foundry Bandgap Data Quickstart for Beginners + +# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MLMI2-CSSI/foundry/blob/main/examples/bandgap/bandgap_demo.ipynb) + +# This introduction uses Foundry to: +# +# +# 1. Instantiate and authenticate a Foundry client locally or in the cloud +# 2. Aggregate data from the collected datasets +# 3. Build a simple predictive model + +# This notebook is set up to run as a [Google Colaboratory](https://colab.research.google.com/notebooks/intro.ipynb#scrollTo=5fCEDCU_qrC0) notebook, which allows you to run python code in the browser, or as a [Jupyter](https://jupyter.org/) notebook, which runs locally on your machine. +# +# The code in the next cell will detect your environment to make sure that only cells that match your environment will run. +# + +# # Environment Set Up +# First we'll need to install Foundry as well as a few other packages. If you're using Google Colab, this code block will install these packages into the Colab environment. +# If you are running locally, it will install these modules onto your machine if you do not already have them. We also have a [requirements file](https://github.com/MLMI2-CSSI/foundry/tree/main/examples/bandgap) included with this notebook. You can run `pip install -r requirements.txt` in your terminal to set up your environment locally. + + +# We need to import a few packages. We'll be using [Matplotlib](https://matplotlib.org/) to make visualizations of our data, [scikit-learn](https://scikit-learn.org/stable/) to create our model, and [pandas](https://pandas.pydata.org/) and [NumPy ](https://numpy.org/)to work with our data. + +from matplotlib.colors import LogNorm +from matplotlib import pyplot as plt +import pandas as pd +import numpy as np +import warnings +import glob +from matminer.featurizers.conversions import StrToComposition +from matminer.featurizers.base import MultipleFeaturizer +from matminer.featurizers import composition as cf +from sklearn.model_selection import cross_val_predict, GridSearchCV, ShuffleSplit, KFold +from sklearn.ensemble import RandomForestRegressor +from sklearn import metrics + + +warnings.filterwarnings('ignore') + +# # Instantiate and Authenticate Foundry +# Once the installations are complete, we can import Foundry. + +from foundry import Foundry + + +# We'll also need to instantiate it. To do so, you'll need a [Globus](https://www.globus.org) account. Once you have your account, you can instantiate Foundry using the code below. When you instantiate Foundry locally, be sure to have your Globus endpoint turned on (you can do that with [Globus Connect Personal](https://www.globus.org/globus-connect-personal)). When you instantiate Foundry on Google Colab, you'll be given a link in the cell's output and asked to enter the provided auth code. + +f = Foundry(no_local_server=True, no_browser=True, index="mdf") + + +# # Loading the Band Gap Data +# Now that we've installed and imported everything we'll need, it's time to load the data. We'll be loading 2 datasets from Foundry using `f.load` to load the data and then `f.load_data` to load the data into the client. Then we'll concatenate them using pandas. +globus = False + +f.load("foundry_mp_band_gaps_v1.1", globus=globus) +res = f.load_data() +X_mp,y_mp = res['train'][0], res['train'][1] + + +f.load("foundry_assorted_computational_band_gaps_v1.1", globus=globus) +res = f.load_data() +X_assorted,y_assorted = res['train'][0], res['train'][1] + + +X, y = pd.concat([X_mp, X_assorted]), pd.concat([y_mp, y_assorted]) + + +# Let's see the data! + +X.head() + + +# # Add Composition Features +# We need to pull out the composition data that will serve as our targets. + +n_datapoints = 300 +data = StrToComposition(target_col_id='composition_obj') +data = data.featurize_dataframe(X[0:n_datapoints], + 'composition', + ignore_errors=True) +y_subset = y[0:n_datapoints]['bandgap value (eV)'] + + +assert(len(y_subset) == len(data)) + + +# # Add Other Features +# Choose the features that we'll use in training. + +feature_calculators = MultipleFeaturizer([cf.Stoichiometry(), + cf.ElementProperty.from_preset("magpie"), + cf.ValenceOrbital(props=['avg']), + cf.IonProperty(fast=True)]) +feature_labels = feature_calculators.feature_labels() + +data = feature_calculators.featurize_dataframe(data, + col_id='composition_obj', + ignore_errors=False); + + +# # Grid Search and Fit Model +# Set up the grid search model using a random forest regressor as our estimator. Then, fit the model! + +quick_demo=False +est = RandomForestRegressor(n_estimators=30 if quick_demo else 150, n_jobs=-1) + +model = GridSearchCV(est, + param_grid=dict(max_features=range(8,15)), + scoring='neg_mean_squared_error', + cv=ShuffleSplit(n_splits=1, + test_size=0.1)) +model.fit(data[feature_labels], y_subset) + + +# # Cross Validation and Scoring +# Perform cross validation to ensure our error values are below the desired thresholds. + +cv_prediction = cross_val_predict(model, + data[feature_labels], + y_subset, + cv=KFold(10, shuffle=True)) + + +for scorer in ['r2_score', 'mean_absolute_error', 'mean_squared_error']: + score = getattr(metrics,scorer)( y_subset, cv_prediction) + print(scorer, score) + + +# # Make Plots +# Plot the data for our bandgap analysis. + +fig, ax = plt.subplots() + +ax.hist2d(pd.to_numeric( y_subset), + cv_prediction, + norm=LogNorm(), + bins=64, + cmap='Blues', + alpha=0.8) + +ax.set_xlim(ax.get_ylim()) +ax.set_ylim(ax.get_xlim()) + +mae = metrics.mean_absolute_error( y_subset, + cv_prediction) +r2 = metrics.r2_score( y_subset, + cv_prediction) +ax.text(0.5, 0.1, 'MAE: {:.2f} eV/atom\n$R^2$: {:.2f}'.format(mae, r2), + transform=ax.transAxes, + bbox={'facecolor': 'w', 'edgecolor': 'k'}) + +ax.plot(ax.get_xlim(), ax.get_xlim(), 'k--') + +ax.set_xlabel('DFT $\Delta H_f$ (eV/atom)') +ax.set_ylabel('ML $\Delta H_f$ (eV/atom)') + +fig.set_size_inches(5, 5) +fig.tight_layout() +fig.savefig('result.png', dpi=320) + + + + diff --git a/examples/machine-learning/foundry/examples/bandgap/result.png b/examples/machine-learning/foundry/examples/bandgap/result.png new file mode 100644 index 00000000..0abfeedf Binary files /dev/null and b/examples/machine-learning/foundry/examples/bandgap/result.png differ diff --git a/examples/machine-learning/foundry/examples/g4mp2-solvation/g4mp2_solvation_demo.py b/examples/machine-learning/foundry/examples/g4mp2-solvation/g4mp2_solvation_demo.py new file mode 100755 index 00000000..700f37f4 --- /dev/null +++ b/examples/machine-learning/foundry/examples/g4mp2-solvation/g4mp2_solvation_demo.py @@ -0,0 +1,114 @@ +#!/usr/bin/env python +# coding: utf-8 + +# + +# # Foundry Solvation Energy Quickstart for Beginners +# +# *Original Paper:* https://doi.org/10.1021/acs.jpca.1c01960 +# +# *Dataset:* https://doi.org/10.18126/c5z9-zej7 +# +# +# +# This introduction uses Foundry to: +# +# +# 1. Instantiate and authenticate a Foundry client locally or in the cloud +# 2. Aggregate data from the G4MP2 solvation database +# 3. Perform basic data exploration +# +# + +# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MLMI2-CSSI/foundry/blob/main/examples/g4mp2-solvation/g4mp2_solvation_demo.ipynb) + +# This notebook is set up to run locally or as a [Google Colaboratory](https://colab.research.google.com/notebooks/intro.ipynb#scrollTo=5fCEDCU_qrC0) notebook, which allows you to run python code in the browser, or as a [Jupyter](https://jupyter.org/) notebook, which runs locally on your machine. +# +# The code in the next cell will detect your environment to make sure that only cells that match your environment will run. +# +# # Environment Set Up +# First we'll need to install Foundry as well as a few other packages. If you're using Google Colab, this code block will install these packages into the Colab environment. +# If you are running locally, it will install these modules onto your machine if you do not already have them. We also have a [requirements file](https://github.com/MLMI2-CSSI/foundry/tree/main/examples/bandgap) included with this notebook. You can run `pip install -r requirements.txt` in your terminal to set up your environment locally. + + +# We need to import a few packages. We'll be using [Matplotlib](https://matplotlib.org/) to make visualizations of our data, [scikit-learn](https://scikit-learn.org/stable/) to create our model, and [pandas](https://pandas.pydata.org/) and [NumPy ](https://numpy.org/)to work with our data. + + +import pandas as pd +import matplotlib.pyplot as plt +import seaborn as sns + +sns.set_context("poster") + + +# # Instantiate and Authenticate Foundry +# Once the installations are complete, we can import Foundry. + +from foundry import Foundry + + +# We'll also need to instantiate it. To do so, you'll need a [Globus](https://www.globus.org) account. Once you have your account, you can instantiate Foundry using the code below. When you instantiate Foundry locally, be sure to have your Globus endpoint turned on (you can do that with [Globus Connect Personal](https://www.globus.org/globus-connect-personal)). When you instantiate Foundry on Google Colab, you'll be given a link in the cell's output and asked to enter the provided auth code. + + +f = Foundry(no_local_server=True, no_browser=True, index="mdf") + + +# Load the Zeolite Database +# Now that we've installed and imported everything we'll need, it's time to load the data. We'll be loading 1 dataset from Foundry using `f.load` to load the data and then `f.load_data` to load the data into the client. + + +f.load("10.18126/jos5-wj65", globus=False) +res = f.load_data() + +X,y = res['train'] +df = pd.concat([X,y], axis=1) # sometimes easier to work with the two together + + +X.head() + + +y.head() + + +# # Data Exploration + + +sns.set_context('poster') +fig, ax = plt.subplots(figsize=(10,10)) + +ax.scatter( + X['u0_atom'], + y['g4mp2_atom'], + c=y['sol_acn'], + s=30, + alpha=0.5 +) + +plt.xlim(-1.75, -1.5) +plt.ylim(-1.75, -1.5) + +ax.set_xlabel("B3LYP atomization energy at 0K (Ha)") +ax.set_ylabel("G4MP2 atomization energy at 0K (Ha)") +sns.despine() + + +sns.set_context('poster') +fig, ax = plt.subplots(figsize=(10,10)) + +ax.scatter( + y['sol_water'], + y['sol_acn'], + c=y['sol_ethanol'], + s=35, + alpha=0.3 +) + +ax.set_xlabel("Solvation Energy in Water (kcal/mol)") +ax.set_ylabel("Solvation Energy in Acetonitrile (kcal/mol)") +sns.despine() + +sns.set_context('paper') +sns.pairplot(df[['sol_water', 'sol_acn','sol_ethanol','sol_dmso']]) + + + diff --git a/examples/machine-learning/foundry/examples/qmc_ml/qmc_ml.py b/examples/machine-learning/foundry/examples/qmc_ml/qmc_ml.py new file mode 100755 index 00000000..b105e1ab --- /dev/null +++ b/examples/machine-learning/foundry/examples/qmc_ml/qmc_ml.py @@ -0,0 +1,93 @@ +#!/usr/bin/env python +# coding: utf-8 + +# + +# # Foundry Quantum Monte Carlo ML Quickstart +# +# *Original Paper:* https://arxiv.org/pdf/2210.06430.pdf +# +# *Dataset:* https://doi.org/10.18126/wg30-95z0 +# + +# This notebook is set up to run locally or as a [Google Colaboratory](https://colab.research.google.com/notebooks/intro.ipynb#scrollTo=5fCEDCU_qrC0) notebook, which allows you to run python code in the browser, or as a [Jupyter](https://jupyter.org/) notebook, which runs locally on your machine. +# +# The code in the next cell will detect your environment to make sure that only cells that match your environment will run. +# + + +no_local_server = True +no_browser = True +globus=False + + +# # Environment Set Up +# First we'll need to install Foundry as well as a few other packages. If you're using Google Colab, this code block will install these packages into the Colab environment. +# If you are running locally, it will install these modules onto your machine if you do not already have them. We also have a [requirements file](https://github.com/MLMI2-CSSI/foundry/tree/main/examples/bandgap) included with this notebook. You can run `pip install -r requirements.txt` in your terminal to set up your environment locally. +# We need to import a few packages. We'll be using [Matplotlib](https://matplotlib.org/) to make visualizations of our data, [scikit-learn](https://scikit-learn.org/stable/) to create our model, and [pandas](https://pandas.pydata.org/) and [NumPy ](https://numpy.org/)to work with our data. + + +import pandas as pd +import matplotlib.pyplot as plt +import seaborn as sns +import pymatgen as mg +from pymatgen.core import Molecule +import json + +sns.set_context("poster") + +# # Instantiate and Authenticate Foundry +# Once the installations are complete, we can import Foundry. + +from foundry import Foundry + + +# We'll also need to instantiate it. To do so, you'll need a [Globus](https://www.globus.org) account. Once you have your account, you can instantiate Foundry using the code below. When you instantiate Foundry locally, be sure to have your Globus endpoint turned on (you can do that with [Globus Connect Personal](https://www.globus.org/globus-connect-personal)). When you instantiate Foundry on Google Colab, you'll be given a link in the cell's output and asked to enter the provided auth code. + +f = Foundry(no_local_server=no_local_server, no_browser=no_browser, index="mdf") + + +# Load the Zeolite Database +# Now that we've installed and imported everything we'll need, it's time to load the data. We'll be loading 1 dataset from Foundry using `f.load` to load the data and then `f.load_data` to load the data into the client. + +f.load("10.18126/wg30-95z0", globus=globus) +res = f.load_data() + + +X,y = res['train'] +df = pd.concat([X,y], axis=1) # sometimes easier to work with the two together + + +# # Read in Molecules to PyMatgen + +df['mols'] = df['pymatgen'].map(lambda x: Molecule.from_str(x, fmt="json")) + + +df['mols'].iloc[1] + + + +# # Data Exploration + +sns.set_context('poster') +fig, ax = plt.subplots(figsize=(7,7)) + +ax.scatter( + y['DMC(HF)'], + y['DMC(HF)_err'], + s=30, + alpha=0.1 +) + +# plt.xlim(-1.75, -1.5) +# plt.ylim(-1.75, -1.5) + +ax.set_xlabel("DMC(HF) (Ha)") +ax.set_ylabel("DMC(HF) error (Ha)") +sns.despine() + + +sns.set_context('poster') +ax = sns.pairplot(y[['PBE','HF','DMC(HF)','DMC(PBE)','DMC(PBE)_err']], hue='PBE') + +fig.savefig('result.png') \ No newline at end of file diff --git a/examples/machine-learning/foundry/examples/qmc_ml/result.png b/examples/machine-learning/foundry/examples/qmc_ml/result.png new file mode 100644 index 00000000..d7af5077 Binary files /dev/null and b/examples/machine-learning/foundry/examples/qmc_ml/result.png differ diff --git a/examples/machine-learning/foundry/kind-config.yaml b/examples/machine-learning/foundry/kind-config.yaml new file mode 100755 index 00000000..62cac8ea --- /dev/null +++ b/examples/machine-learning/foundry/kind-config.yaml @@ -0,0 +1,10 @@ +# Run this from this directory! +# kind create cluster -f kind-config.yaml +# kubectl apply -f ./examples/dist/flux-operator.yaml +apiVersion: kind.x-k8s.io/v1alpha4 +kind: Cluster +nodes: + - role: control-plane + extraMounts: + - hostPath: "." + containerPath: /tmp/workflow \ No newline at end of file diff --git a/examples/machine-learning/foundry/minicluster.yaml b/examples/machine-learning/foundry/minicluster.yaml new file mode 100755 index 00000000..c7942a69 --- /dev/null +++ b/examples/machine-learning/foundry/minicluster.yaml @@ -0,0 +1,41 @@ +apiVersion: flux-framework.org/v1alpha1 +kind: MiniCluster +metadata: + name: flux-sample + namespace: flux-operator +spec: + + size: 4 + + # We have a lot of examples to try - easy to do interactively + + interactive: true + # This is created with the kind-config.yaml + # You should only need to pull once (the container is pulled to bound volume) + volumes: + data: + storageClass: hostpath + path: /tmp/workflow + + # This is a list because a pod can support multiple containers + containers: + - image: ghcr.io/rse-ops/singularity:tag-mamba + workingDir: /tmp/workflow + commands: + pre: | + pip install foundry_ml + conda install -c conda-forge scikit-learn scikit-image + pip install keras-unet seaborn pandas pymatgen matminer opencv-python \ + tables tensorflow matplotlib + + fluxUser: + name: fluxuser + + # Container will be pre-pulled here only by the broker + volumes: + data: + path: /tmp/workflow + + # Running a container in a container + securityContext: + privileged: true \ No newline at end of file