The main impetus of following the recommended workflow for this project is to help make it easier to share your datasets, code and analyses in a reproducible way and easy-to-use way.
We want you to share your work. We understand that your work may still be a work-in-progress when you first start to share it. We encourage that. There are three main ways to contribute to this repo:
- Filing and reporting issues: Please don't be shy here. Chances are if you encounter an issue, someone else already has, or someone else will encounter the same issue in the future. Reporting helps us to find solutions that will work for everyone. Hacks and your personal work-arounds are not reproducible. No issue is too small. Share the love and let us solve issues as best we can for everyone. Issues include anything from "I had trouble understanding and following the documentation", to feature requests, to bugs in the shared codebase iteself.
- First up, make sure that you're working with the most up-to-date version of the codebase.
- Check the troubleshooting guide to see if a solution has already been documented.
- Check if the issue has been reported already. If so, make a comment on the issue to indicate that you're also having said issue.
- Finally, if your issue hasn't been resolved at this stage, file an issue. For bugs reports, please include reproducers.
- Submitting Pull Requests (PRs): This is the way to share your work if it involves any code. To prepare your PR, follow the contributor checklist. In the meantime, follow the recommended best practices to make your life easier when you are ready to share.
When is my work ready to share? Let's find out!
When you are ready share your notebook or code with others, you'll be able to tick all of the following boxes.
- Notebooks are in the
notebooksdirectory following the notebook naming convention. - Notebooks load data via the
Dataset.load()API to access an available Dataset. - Functions are in
src/user_nameand accessed in notebooks via something likefrom src.user_name import my_function. If you havedef my_functionin your notebook or anything more elaborate, there's a good chance that it should be in thesrcmodule. - Notebook cells run sequentially (i.e. Kernel->Restart & Run All runs to completion successfully).
- (Optional but generally recommended): All notebook cell output has been cleared before checking it in (i.e. Kernel->Restart & Clear Output before saving).
- Decide on a license for your data derived work (e.g. images) and if it's not the same as that of the dataset you used, mark it appropriately as per your license of choice (assuming it's compatible with the dataset's license). By default, the license of derived work will be the same as the dataset it came from.
- Share your conda environment. Check in your
environment.ymlfile if you've made any changes.- If there's any chance that you added something to the conda environment needed to run your code that was not added via your
environment.ymlfile as per Setting up and Maintaining your Conda Environment (Reproducibly), delete your environment and recreate it.
- If there's any chance that you added something to the conda environment needed to run your code that was not added via your
- (Optional) Make sure all tests pass (run
make test). This will test all of the dataset integration so if you don't have a lot of room on your machine (as it will build all the the datasets if you haven't yet), you may want to skip this step. - At least, make sure all of the tests for your code pass. To subselect your tests you can run
pytest --pyargs src -k your_test_filename.
- You've merged the latest version of
upstream/maininto your branch. - Submitted a PR via github.com in Draft status and checked the PR diff to make sure that you aren't missing anything critical, you're not adding anything extraneous, and you don't have any merge conflicts.
Once this checklist is complete, take your PR out of Draft status. It's ready to go!
As a person who is trying to contribute and share your work with others, it may at times feel like this is a lot of work. We get that, and find it useful to think of it this way: for every 5 minutes extra that you put into making your work reproducible, everyone else who tries to run or use your work will spend at least 5 minutes less trying to get it to work for them. In other words, making your work reproducible is part of being a good citizen and helping us all to learn from each other as we go. Thank you for helping us to share and use your work!
Quick References:
- Keeping up-to-date: Our Git Workflow
- Recommended Git tutorial
There are several ways to use Git and github.com successfully, and a lot more ways to use them unsuccessfully when working with lots of other people. Here are some best practices that we suggest you use to make your life, and our lives easier. This workflow we suggest makes choosing which changes to put in a pull request easier, and helps to avoid crazy merge conflicts.
First off, follow the Getting Started instructions for setting yourself up to work from your own fork. The idea here will be to keep upstream/main, your local main and your origin/main all in sync with each other.
Any changes should be made in a separate branch---not your main---that you push up to your fork. Eventually, when you're ready to submit a PR, you'll do so from the branch that you've been working on. When you push to your origin/branch_name, you should get prompted in the terminal by git with a URL you can follow to submit a PR. To do so:
- Make sure your
mainis up-to-date with upstreamgit fetch upstreamandgit merge upstream/main - Make sure your environment is up-to-date with upstream
make update_environment - Start your work (from your up-to-date
main) in a new branch:git checkout -b my_new_branch - Commit all your changes to
my_new_branch(as per the Easydata git Workflow)
You can pretty much blindly do this by following the Easydata git Workflow religiously.
- Push to your github.com fork by
git push origin my_new_branch. - If this is the first time you do this from
my_new_branch, you'll be prompted with a URL from your terminal for how to create a PR. Otherwise, if you go to github.com, you'll see a yellow banner at the top of the screen prompting you to submit a PR (as long as you're not out of sync with theupstream main, in which case, re-sync your branch). - You have the option to submit a PR in Draft status. Select this if you have a work in progress. It disables the ability to merge your PR.
- Once you submit your PR, there may be a yellow dot or red X beside your PR. This is because we have tests set up in CircleCI. If you are working in a private repo, you need to authorize access to CircleCI on your fork for tests to run successfully. To do so, follow the link to CircleCI and authorize github.com on your fork of the repo.
- When ready, take your PR out of Draft status.
- Never commit your changes to your
mainbranch. Always work from a branch. Then you always have a clean local copy ofupstream/mainto work from. - Stick to basic git commands unless you really know what you're doing. (e.g. use
add,fetch,merge,commit,diff,rm,mv) - While sometimes convenient, avoid using
git pullfrom remotes. Or just general avoid usinggit pull. Usegit fetchthengit mergeinstead. - Use
git add -pinstead ofgit addto break up your commits into logical pieces rather than one big snotball of changes.
Most of the infrastructure behind the scenes in this repo is set up for sharing datasets reliably and reproducibly without ever checking it in. We use recipes for making Datasets instead. So in short, don't check in data. And use the Dataset.load() API.
In order to convert your data to a Dataset object, we will need to generate a catalog recipe, that uses a custom function for processing your raw data. Doing so allows us to document all the munging, pre-processing, and data verification necessary to reproducibly build the dataset. Details on how to do this can be found on the cookiecutter-easydata repo, but it's likely better to ask the maintainers of this project can point you in the right direction for how to get a Dataset added to this project.
For more on Dataset objects, see Getting and Using Datasets.
For more on licenses, see below.
In order to make sharing virtual environments easy, the repo includes make commands that you can use to manage your environment via an environment.yml file (and corresponding environment.${ARCH}.lock.yml file). By setting up and maintaining your conda environment reproducibly, sharing your environment is as easy as including any changes to your environment.yml file in your PR.
If there's any chance that you added something to the conda environment needed to run your code that was not added via your environment.yml, delete your environment, recreate it and then make the appropriate changes to your environment.yml file.
Remember to make update_environment regularly after fetching and merging the upstream remote to keep your conda environment up-to-date with the shared (team) repo.
We're keen on sharing notebooks for sharing stories and analyses. Best practices can be found in using notebooks for sharing your analysis. A short list of reminders:
- Follow the notebook naming convention
- Use the
Dataset.load()API for accessing data - Put code in the
srcmodule undersrc/xyzwherexyzis your (the author's) initials (as in the notebook naming convention) - Run Kernel->Restart & Run All and optionally Kernel->Restart & Clear Output before saving and checking in your notebooks
Work in progress...Add some references