-
Notifications
You must be signed in to change notification settings - Fork 0
2. Getting Started
The marketplace serves as a centralized platform where users can share their datasets and models with others in the SEDIMARK community. The present documentation walks users through the process of creating an identity, browsing the catalogue, and managing offerings within the SEDIMARK Marketplace.
The first step to interact with the SEDIMARK Marketplace is to create an identity. This identity will be used to publish and consume offerings within the marketplace. A wizard is available to guide users through the creation of an identity, and is accessible via the Register button on the right of the marketplace navigation bar.
This wizard consists of three steps:
- an introduction to the prerequisites for creating an identity (no user action required),
- a form to fill in the identity details (name, description, URLs to public services, etc.),
- a summary of the created identity, featuring the user's Decentralized Identifier (DID) and Verifiable Credential (VC).
The form in step 2 requires the following fields to be filled in:
- a username: this will be the public name of the identity in the marketplace.
- the self-listing URL: this URL should point to the list of published offerings by the user, i.e. the
offeringsendpoint of the user's offering manager instance. It is used by catalogue coordinators to index the user's offerings. It must be publicly accessible. - the connector data space protocol URL: this URL should point to the user's connector Data Space Protocol endpoint. It is used by connectors to exchange information and proceed to transactions. It must be publicly accessible.
- the profile server URL: this URL should point to the user's profile server endpoint. It is used by other users to fetch information about the user. It must be publicly accessible.
These information will be stored in the user's DID document. All other information (names, website, profile picture, etc.) is optional and stored only in the user's server profile, which is hosted in the user's premises.
Once completed and submitted, the form will trigger the creation of a DID and a VC for the user. The latter will be stored in the user's DLT booth on the user's premises, so no action is required from the user to manage them. Because of this, the DLT booth must NOT be exposed to the public internet. In the final step of the wizard, the user is presented with their DID and VC.
The SEDIMARK Marketplace catalogue is the central place to discover datasets and models shared by the SEDIMARK community. It can be accessed in two ways:
- via the Catalogue button in the marketplace navigation bar, from any page of the marketplace,
- via the search bar in the home page.
The catalogue features a search bar to filter offerings by their title or description. Searches can be further refined by filtering by keywords/tags or providers. Each offering is presented as a card, featuring its name, description, creation date, tags, and the provider's username. At the end of the search results, users can review a list of recommended offerings based on their search.
Selecting an offering card will redirect the user to the offering details page, where more information about the offering is presented, including:
- the offering's metadata (name, description, creation date, tags, provider, etc.),
- the offering's provider information (username, profile picture, description, website, etc.),
- a button to initiate the negotiation process to consume the offering.
At this stage, clicking the Negotiate button automatically creates a contract agreement with the offering's provider.
Any participant in the SEDIMARK Marketplace can be both a provider and a consumer of offerings. As a provider, users can publish their datasets and models to share them with the community. As a consumer, users can browse the catalogue and consume offerings published by others.
The offering publication form can be accessed via the Publish button in the marketplace navigation bar. Upon opening the form, users are presented with two options:
- creating an offering based on an existing asset (dataset or model): this option allows users who already created datasets or models (for instance in Mage AI) to prefill the offering publication form with the asset's metadata.
- creating a new offering from scratch.
To create an offering, users must provide the following information:
- the metadata of the offering: this includes the offering's name, description, tags/keywords, etc...
- the location of the asset: this is the URL where the dataset or model can be fetched from. It should be accessible via an HTTP GET request, however it does not need to be publicly accessible, as the SEDIMARK Connector will handle secure data transfer to the consumers. It should only be accessible to the user's connector.
- the license and usage policy: this section allows users to specify the terms under which the offering can be used by others. It can include a link to a license document and/or a time-limited usage policy.
Once submitted, the offering will be published in the user's offering manager instance. It may take some time (~5 min) for catalogue coordinators to index the new offering and make it visible in the marketplace catalogue.
Users can manage their published offerings via the Dashboard button in the marketplace navigation bar. The dashboard features an Offerings section, where users can review their published offerings and delete them if needed, effectively removing them from the marketplace catalogue.
Similarly, the Contracts section of the dashboard allows users to review their active contract agreements, both as providers and consumers. Selecting an offering expands it to show the recent data transfers associated with the contract, along with their status (in progress, completed, failed, etc.). A more general overview of all the transfers the user is involved in can be found in the Overview section of the dashboard.
Users can access the data asset from an offering they have consumed by selecting it in the Consumed section of the Contracts Dashboard. Selecting an offering expands it to reveal a Start Transfer button. Clicking this button opens a dialog, where the user is presented with two methods to acquire the data:
- the push method: for this, the user must provide a URL where the data will be pushed to. This URL should point to an endpoint in the user's premises that is accessible by her/his connector, it doesn't have to be publicly accessible.
- the pull method: in that case, the user requests access to the data asset directly from the provider's connector. The provider's connector will then make the data available for download at a secure URL, issuing a token to the user, which he/she can use to fetch the data.
Figure 1. SEDIMARK Toolbox architecture
The toolbox is the intelligence part of the SEDIMARK platform, containing all the tools and components that are used to create, as illustrated in the above figure, manipulate and work with standardized datasets. Besides datasets, the tools inside the Toolbox can be used to create, train and make predictions with AI models that are stored and managed using a model registry (MLFlow).
Pipeline Management: The Toolbox uses automated pipelines to orchestrate workflows across all components:
- Data Pipelines: Handle data ingestion, preprocessing, validation, and standardization
- ML Pipelines: Manage feature engineering, model training, validation, and deployment
- Inference Pipelines: Execute real-time and batch predictions
These pipelines ensure seamless integration between tools, datasets, and models, providing automated dependency management, version control, and reproducible workflows throughout the entire process lifecycle.
The documentation on how to create an interact with a pipeline from Mage AI can be found at https://docs.mage.ai/design/data-pipeline-management
To accommodate a better and more intuitive user experience for the Orchestrator UI, MageAI pipelines are designed with a simpler single flow design. Therefore, more complexity can be added through the chained subsequent pipelines described in the section Generic Pipeline Architecture.
Below there are 2 examples of pipeline flows, one compliant presenting a single flow in Figure 2 and another one which presents a multi-flow architecture in Figure 3.
Figure 2. ✅ Pipeline compliant with Orchestrator UI
Figure 3. ❌ Pipeline not compliant with Orchestrator UI
To achieve compliance between MageAI pipelines and Orchestrator UI workflows a set of specifications were defined. One important specification is related to tagging of the MageAI pipelines enabling Orchestrator UI workflows visibility and grouping into categories based on their scope and purpose.
The available identification tags for MageAI pipelines that are compatible with Orchestrator are presented in the following table:
| Tag Name | Description |
|---|---|
| data_preprocessing | Tag for subsequent (child) preprocessing pipelines of the generic pipeline |
| data_manipulation | Tag for subsequent (child) manipulation pipelines of the generic pipeline |
| train | Training pipelines |
| predict | Inference pipelines |
| processing | Processing pipelines |
| streaming | Pipelines that run continuously and stream data, used by federated learning pipelines. |
Disclaimer! MageAI Pipelines that do not contain any of the identification tag will not be shown in the Orchestrator UI. The process of creating and tagging a pipeline is illustrated in the following MageAI pop-up screenshot.
Figure 4. Pipeline tagging at the creation stage
The configuration of the Orchestrator UI workflows runs is done through the usage of variables. variables definition help users configure and control the execution of their workflows at the block level. In Mage the definition of variables for a pipeline is done through the metadata.yaml pipeline file which is found in the file explorer under the pipelines directory as shown in the following screenshot:
Figure 5. metadata.yaml file location for pipeline anomaly_annotator
The metadata.yaml configuration file is automatically created by Mage after a new Pipeline is instantiated and comprises information that describes the pipeline structure. The anomaly_annotator pipeline example is comprised of 3 blocks (Data loader, Transformer and Data exporter), where each block contains the custom definitions of variables under the configuration attribute as shown in the following code snippet:
blocks:
- all_upstream_blocks_executed: true
color: null
configuration:
attrs:
default: https://vocab.sedimark.io/temperature
description: Filtering attributes to filter timeseries for the selected entity
type: string
end_time:
default: null
description: The end date of the time interval.
format: YYYY-MM-DDThh:mm:ssZ
type: date
entity_id:
default: urn:ngsi-ld:Sedimark:Temeperature:123456789
description: This is the ID of the entity that is stored in the NGSI-LD Broker
type: string
start_time:
default: '2022-11-16T07:00:00Z'
description: The start date of the time interval.
format: YYYY-MM-DDThh:mm:ssZ
type: date
get_data_from_broker:
default: true
description: If true, the data will be fetched from the NGSI-LD Broker.
type: boolean
downstream_blocks:
- anomaly_detection
- histogram_for_broker_loader_1707813944696
executor_config: null
executor_type: local_python
has_callback: false
language: python
name: broker_loader
retry_config: {}
status: executed
timeout: null
type: data_loader
upstream_blocks: []
uuid: broker_loader
- all_upstream_blocks_executed: true
color: null
configuration:
threshold_type:
default: AUCP
description: This is the threshold type for the anomaly detection algorithm.
type: drop_down
downstream_blocks:
- export_anomalies
executor_config: null
executor_type: local_python
has_callback: false
language: python
name: anomaly_detection
retry_config: {}
status: executed
timeout: null
type: transformer
upstream_blocks:
- broker_loader
uuid: anomaly_detection
- all_upstream_blocks_executed: true
color: null
configuration: {}
downstream_blocks: []
executor_config: null
executor_type: local_python
has_callback: false
language: python
name: export_anomalies
retry_config: {}
status: failed
timeout: null
type: data_exporter
upstream_blocks:
- anomaly_detection
uuid: export_anomalies
cache_block_output_in_memory: false
callbacks: []
concurrency_config: {}
conditionals: []
created_at: '2023-11-14 11:26:30.357670+00:00'
data_integration: null
description: data_preprocessing
executor_config: {}
executor_count: 1
executor_type: null
extensions: {}
name: anomaly_annotator
notification_config: {}
remote_variables_dir: null
retry_config: {}
run_pipeline_in_one_process: false
settings:
triggers:
save_in_code_automatically: true
spark_config: {}
tags:
- data_preprocessing
type: python
uuid: anomaly_annotator
variables_dir: /home/src/mage_data/default_repo
widgets: []The variables definition are represented as mappings in Mage configuration file, following a list of attributes such as: default, description, type and format. Furthermore, there is currently support for 10 types of variables which users can choose from:
| Variable type | Description |
|---|---|
| string | Simple text input that can be used for general purpose string input |
| secret | Password input |
| number | Number input |
| multiple_selection | Drop down for multiple selections |
| drop_down | Drop down for a single selection |
| date | Date input |
| boolean | True or False value |
| array | A list of values |
| trigger | Child pipeline trigger reference |
| dictionary | Dictionary of key value pairs. |
-
string
The string type needs to contain a description about what the variables represents. It might also contain a regex entry to specify how to variable needs to look like and it is used to validate user input in Orchestrator UI.
Example:
string_name: type: string description: About the variable default: '' regex: '^.*$'
-
secret
The secret type of variable needs to contain the variable description, it renders an input of type password on the user interface.
Example:
secret_name: type: secret description: What this secret is about
-
number
The number type variable must specify the description, optional a default value and range interval for the input.
Example:
number_name: type: number range: [0, 10] description: The description default: 0
-
drop_down
The drop_down type specify a list of values that can be selected from.
Example:
drop_down_name: type: drop_down description: The description default: value1 values: - value1 - value2
-
multiple_selection
Multiple selection type is simmilar to drop_down type providing additional support for multiple variable selection.
multiple_selection_name: type: drop_down description: The description default: value1 values: - value1 - value2
-
date
The date time type is used to specify a date by following the specified format.
date_name: type: date description: The description default: 2025-01-02 format: "YYYY-MM-DD" ```
-
array
The array format is used to enumerate multiple values.
array_name: type: array description: The description.
-
trigger
The trigger type enables the selection for execution of a child Mage pipeline, the pipelines are filtered using the input tag.
trigger_name: default: data_preprocessing_test # The actual pipeline_id to run by default. description: Trigger for the data preprocessing pipeline tag: data_preprocessing # The tag to specify the type of pipeline type: trigger
-
dictionary
The dictionary type is used to input key-value pairs.
dictionary_name: type: dictionary description: The description.
To ensure faster development, seamless compatibility and support across multiple use cases a generic Mage pipeline was developed. The Generic pipeline enables cross operability between various data preprocessing, data manipulation and data postprocessing tasks that allow for a configurable execution process inside the Orchestrator UI application. Moreover, the generic pipeline ensures compatibility with the SEDIMARK ecosystem by providing the necessary tools for handling NGSI-LD assets to support both consumers and producers on leveraging their own resources. The architecture of the Generic pipeline architecture is depicted in the following figure:
Figure 6. Generic pipeline architecture
To ensure compatibility within the generalized pipeline (parent) and subsequent executed pipeline (child) , the architecture is centered around a standardized data flow that uses pandas DataFrame objects, with specialized Data Interoperability blocks serving as crucial conversion layers between NGSI-LD format and DataFrame structures, enabling bidirectional data transformation. The system supports comprehensive data processing child pipelines through two primary categories:
- Data Preprocessing - which includes data cleaning, transformation, anonymization, feature engineering, time series preprocessing, and data validation operations
- Data Manipulation - encompassing AI model training, inference, data aggregation and summarization, and KPI computation capabilities
- Data Postprocessing - includes all the necessary operation to either apply the inverse operation of the processing stage, or to prepare the data for exporting it back to the broker.
Furthermore, another important feature is the external data source integration which enables users to enrich their datasets and metadata information beyond marketplace offerings, while the integrated MLOps component provides essential model provisioning, storage, versioning, and lifecycle management throughout the pipeline execution. The output DataFrame maintains detailed variable information that serves as the foundation for generating comprehensive data asset metadata, facilitating the creation of new marketplace offerings based on processed results. This architecture ensures that regardless of pipeline complexity, users can leverage both the advanced capabilities of MageAI and the simplified interface of the Orchestrator UI according to their technical requirements.
To support all of this the SEDIMARK Tool Box includes built-in SEDIMARK generic execution pipeline and pipelines for data preprocessing, data manipulation and data postprocessing operations, some of the techniques used by the child pipelines are shown in the architecture diagram on the right side of the figure.
To run a pipeline directly from Mage AI using the API, you need first to create a trigger for the wanted pipeline, that can be called with the specified configuration for that pipeline run.
To do this go to the pipeline and on the left menu click on triggers:
Figure 7. Mage AI pipeline view
After that you need to create the trigger using the New Trigger button:
Figure 8. Creating a trigger
For a new trigger the type must be set to API. The name and description can be specified, and the API endpoint that start the pipeline is shown as in the figure below:
Figure 9. Configuring the trigger
Enabling the trigger:
Figure 10. Activating the trigger for request
Making an API call to start the pipeline:
Figure 11. API call to the trigger to start a new run for the pipeline
Checking the status in Mage AI:
Figure 12. Checking the status in the UI
MLflow serves as the primary model registry within the SEDIMARK Toolbox. It stores all models created through the Toolbox, along with their associated metadata—such as performance metrics, plots, and the actual model files.
Once deployed, MLflow can be accessed at http://localhost:5000/, using the default credentials specified in the .env file (or custom ones if they were updated).
The MLflow UI provides two key sections:
- Experiment page – Displays all model runs. A run corresponds to the process of creating a model through MLflow after training.
Figure 13. MLFlow runs page
By clicking on a model we can see the metrics and metadata saved for a specific training epoch of the model.
Figure 14. Showing the metadata saved for an epoch in the UI
- Registered models page – Shows the verified models that are recognized as the most accurate and reliable for their intended tasks.
Figure 15. All the registered models and their current version
Figure 16. Showing the information of a registered model version