Skip to content

Add explode dataset task#121

Open
RobinM-code wants to merge 1 commit into
opentargets:mainfrom
thehyve:rm/explode_datasets-task
Open

Add explode dataset task#121
RobinM-code wants to merge 1 commit into
opentargets:mainfrom
thehyve:rm/explode_datasets-task

Conversation

@RobinM-code
Copy link
Copy Markdown
Contributor

This PR adds a explode_datasets task.
This task can be configured like this in config.yaml:

    - name: explode_datasets load all clickhouse datasets
      section: clickhouse
      dataset_config_path: config/internal_datasets.yaml
      each_placeholder: each
      do:
        - name: clickhouse_load ${each}
          dataset: ${each}
          clickhouse_database: ${database_namespace}
          data_dir_parent: ${clickhouse_input_data}
          dataset_config_path: config/internal_datasets.yaml

It will load all datasets defined in datasets.yaml without the need to re-specify all of them in config.yaml

@jdhayhurst
Copy link
Copy Markdown
Contributor

Thanks @RobinM-code, I see the advantage here.

There are few issues that would prevent us from using this.

  1. At OpenTargets, we have two products, platform and ppp, which have different datasets defined. To handle this, we would need to have a "product" annotation to assign datasets to each product.
  2. For tasks such as loading to clickhouse, one dataset can have dependencies on other datasets. If the intention is to clean up the config, it will not fully achieve this. Perhaps if we had another way to define the dependencies that would be beneficial.
  3. During development, it's very often the case that we want to load datasets in isolation, so having them appear in the config is useful.
  4. I think a cleaner solution would be to have the datasets defined in an array in the config scratchpad. We sort of do this using the yaml & and * syntax.

I'll wait for @javfg to comment on this. As it stands, we could merge it as a feature for the community, but as I said, I don't think we'd be able to use it in OpenTargets just now.

@RobinM-code
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed feedback @jdhayhurst, those are some very valid points!

I agree this doesn’t fully cover Open Targets’ needs right now (especially around product separation and dataset dependencies), and the current setup still makes sense for those use cases.

That said, I think this is still a nice optional addition for simpler setups or external users, so I’d be happy to see it merged as a community feature.

Also worth noting this doesn’t change or break the existing approach, so there’s no need for the OT team to switch unless you want to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants