diff --git a/CHANGELOG.md b/CHANGELOG.md index 12caa0cc..ef3d985a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,21 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [1.7.10] - 2026-04-04 + +### Changed + +- **Logical schema vocabulary**: Vertex payloads use **`properties`** (list of property names and/or typed `Field` entries) instead of **`fields`**. Edge payloads use **`properties`** for relationship attributes instead of nested **`weights`** / **`weights.direct`**. Internal DB projection still builds a `WeightConfig` where backends need it, but authored YAML/Python schema should declare edge attributes on `Edge.properties` only. + +### Breaking + +- **`Vertex`**: The `fields` attribute was removed; use **`properties`** everywhere (manifest `graph.vertex_config.vertices[*].properties`, Python `Vertex(properties=[...])`). +- **`Edge`**: The `weights` / `WeightConfig` shape on logical edges was removed; use **`properties`** for the same data (strings, `Field`, or dicts). Vertex-sourced edge payload wiring belongs in ingestion (**`EdgeActor`** / **`EdgeDerivation`**, edge derivation registry), not on the logical `Edge` model. + +### Documentation + +- README, docs landing page, concepts, manifest guide, and examples updated for **`properties`**-first schema authoring and clearer “what this project is” intros. + ## [1.7.9] - 2026-04-01 ### Added diff --git a/README.md b/README.md index c5c65915..710d6eec 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,8 @@ # GraFlo graflo logo -A **Graph Schema & Transformation Language (GSTL)** for Labeled Property Graphs (LPG). +**GraFlo** is a **Graph Schema & Transformation Language (GSTL)** for **labeled property graphs (LPGs)**. You describe the graph once—**vertices and edges**, typed **`properties`**, identity, and optional backend hints—in **YAML or Python**. You describe how raw records become that graph using **resource** pipelines (an expressive sequence of **actors**: descend, transform, vertex, edge, and routers). **Connectors** attach files, SQL tables, SPARQL/RDF, APIs, or in-memory data to those pipelines. **`GraphEngine`** and **`Caster`** then infer schema when possible, project the logical model for a chosen database, and ingest. -GraFlo provides a declarative, database-agnostic specification for mapping heterogeneous data sources — tabular (CSV, SQL), hierarchical (JSON, XML), and RDF/SPARQL — to a unified LPG representation and ingesting it into ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, or NebulaGraph. +**Why it matters:** the **logical graph** is **database-agnostic**; the same manifest can target **ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, or NebulaGraph** without rewriting your transformation story. Backend-specific names, defaults, and indexes are applied only at **DB-aware projection** (`Schema.resolve_db_aware(...)`). > **Package Renamed**: This package was formerly known as `graphcast`. @@ -13,6 +13,15 @@ GraFlo provides a declarative, database-agnostic specification for mapping heter [![pre-commit](https://github.com/growgraph/graflo/actions/workflows/pre-commit.yml/badge.svg)](https://github.com/growgraph/graflo/actions/workflows/pre-commit.yml) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.15446131.svg)]( https://doi.org/10.5281/zenodo.15446131) +## Core ideas + +| Idea | What you get | +|------|----------------| +| **Logical LPG first** | One declarative **schema** (`Vertex` / `Edge` with **`properties`**) is the source of truth—not a particular vendor’s DDL. | +| **Expressive transformation** | **`Resource`** pipelines compose small **actors** so wide tables, nested JSON, RDF, or API payloads map cleanly to vertices and edges—reusable across sources. | +| **Separation of concerns** | **Sources** (connectors + `DataSourceRegistry`), **shape of the graph** (`Schema`), and **ingestion steps** (`IngestionModel`) evolve independently. | +| **Safe wiring** | Optional **`connector_connection`** maps connectors to **`conn_proxy`** labels so manifests stay free of secrets; a runtime **`ConnectionProvider`** supplies credentials. | + ## Overview GraFlo separates *what the graph looks like* from *where data comes from* and *which database stores it*. @@ -41,13 +50,13 @@ flowchart LR SI --> R --> GS --> GC --> DBA --> DB ``` -**Source Instance** → **Resource** → **Logical Graph Schema** → **Covariant Graph Representation** → **DB-aware Projection** → **Graph DB** +**Source Instance** → **Resource** (actors) → **Logical Graph Schema** → **Covariant Graph Representation** (`GraphContainer`) → **DB-aware Projection** → **Graph DB** | Stage | Role | Code | |-------|------|------| | **Source Instance** | A concrete data artifact — a CSV file, a PostgreSQL table, a SPARQL endpoint, a `.ttl` file. | `AbstractDataSource` subclasses (`FileDataSource`, `SQLDataSource`, `SparqlEndpointDataSource`, …) with a `DataSourceType`. | | **Resource** | A reusable transformation pipeline — actor steps (descend, transform, vertex, edge, vertex_router, edge_router) that map raw records to graph elements. Data sources bind to Resources by name via the `DataSourceRegistry`. | `Resource` (part of `IngestionModel`). | -| **Graph Schema** | Declarative logical vertex/edge definitions, identities, typed fields, and DB profile — defined in YAML or Python. | `Schema`, `VertexConfig`, `EdgeConfig`. | +| **Graph Schema** | Declarative logical vertex/edge definitions, identities, typed **properties**, and DB profile — defined in YAML or Python. | `Schema`, `VertexConfig`, `EdgeConfig`. | | **Covariant Graph Representation** | A database-independent collection of vertices and edges. | `GraphContainer`. | | **DB-aware Projection** | Resolves DB-specific naming/default/index behavior from logical schema + `DatabaseProfile`. | `Schema.resolve_db_aware()`, `VertexConfigDBAware`, `EdgeConfigDBAware`. | | **Graph DB** | The target LPG store — same API for all supported databases. | `ConnectionManager`, `DBWriter`, DB connectors. | @@ -69,12 +78,12 @@ ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph — same API for al ## Features -- **Declarative LPG schema** — Define vertices, edges, vertex identity, secondary DB indexes, weights, and transforms in YAML or Python. The `Schema` is the single source of truth, independent of source or target. +- **Declarative LPG schema** — Define vertices, edges, vertex identity, secondary DB indexes, edge **properties**, and transforms in YAML or Python. The `Schema` is the single source of truth, independent of source or target. - **Database abstraction** — One logical schema, one API. Target ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, or NebulaGraph without rewriting pipelines. DB idiosyncrasies are handled in DB-aware projection (`Schema.resolve_db_aware(...)`) and connector/writer stages. - **Resource abstraction** — Each `Resource` defines a reusable actor pipeline (descend, transform, vertex, edge, plus **VertexRouter** and **EdgeRouter** for dynamic type-based routing) that maps raw records to graph elements. Data sources bind to Resources by name via the `DataSourceRegistry`, decoupling transformation logic from data retrieval. - **SPARQL & RDF support** — Query SPARQL endpoints (e.g. Apache Fuseki), read `.ttl`/`.rdf`/`.n3` files, and auto-infer schemas from OWL/RDFS ontologies (`rdflib` and `SPARQLWrapper` ship with the default package). -- **Schema inference** — Generate graph schemas from PostgreSQL 3NF databases (PK/FK heuristics) or from OWL/RDFS ontologies (`owl:Class` → vertices, `owl:ObjectProperty` → edges, `owl:DatatypeProperty` → vertex fields). -- **Typed fields** — Vertex fields and edge weights carry types (`INT`, `FLOAT`, `STRING`, `DATETIME`, `BOOL`) for validation and database-specific optimisation. +- **Schema inference** — Generate graph schemas from PostgreSQL 3NF databases (PK/FK heuristics) or from OWL/RDFS ontologies (`owl:Class` → vertices, `owl:ObjectProperty` → edges, `owl:DatatypeProperty` → vertex properties). +- **Typed properties** — Vertex and edge **`properties`** may carry types (`INT`, `FLOAT`, `STRING`, `DATETIME`, `BOOL`) for validation and database-specific optimisation. - **Parallel batch processing** — Configurable batch sizes and multi-core execution. - **Credential-free source contracts** — `Bindings.connector_connection` maps each `TableConnector` / `SparqlConnector` (by **connector name** or **hash**) to a `conn_proxy` label. Manifests stay free of secrets; a runtime `ConnectionProvider` resolves each proxy to concrete `GeneralizedConnConfig` (for example PostgreSQL or SPARQL endpoint settings). Ingestion resource names are separate and may map to multiple connectors. diff --git a/docs/concepts/backend_indexes.md b/docs/concepts/backend_indexes.md index eab23fa1..063f54e8 100644 --- a/docs/concepts/backend_indexes.md +++ b/docs/concepts/backend_indexes.md @@ -34,4 +34,4 @@ When `schema` is `None` in `define_vertex_indexes`, identity indexes cannot be e Vertex upserts use node keys from `Vertex` identity. For edges, endpoints are matched on those vertex keys; the relationship itself is merged using a **relationship property map** so parallel edges remain distinct. -GraFlo chooses property names for that map from the edge’s logical identity policy: the **first** entry in `Edge.identities` (excluding `source` / `target` tokens; including a `relation` token as the relationship’s `relation` property when applicable). If `identities` is empty or does not name any relationship fields, **all** `weights.direct` field names are used instead. Compile-time edge **indexes** from `identities` (via `database_features`) remain separate from this writer-time `MERGE` key selection; both should agree with your intended uniqueness for a given edge definition. +GraFlo chooses property names for that map from the edge’s logical identity policy: the **first** entry in `Edge.identities` (excluding `source` / `target` tokens; including a `relation` token as the relationship’s `relation` property when applicable). If `identities` is empty or does not name any relationship fields, **all** declared edge **`properties`** names are used instead. Compile-time edge **indexes** from `identities` (via `database_features`) remain separate from this writer-time `MERGE` key selection; both should agree with your intended uniqueness for a given edge definition. diff --git a/docs/concepts/index.md b/docs/concepts/index.md index af334b38..46653a74 100644 --- a/docs/concepts/index.md +++ b/docs/concepts/index.md @@ -44,7 +44,7 @@ flowchart LR - **Source Instance** — a concrete data artifact (a file, a table, a SPARQL endpoint), wrapped by an `AbstractDataSource` with a `DataSourceType` (`FILE`, `SQL`, `SPARQL`, `API`, `IN_MEMORY`). - **Resource** — a reusable transformation pipeline (actor steps: descend, transform, vertex, edge) that maps raw records to graph elements. Data sources bind to Resources by name via the `DataSourceRegistry`. - **GraphManifest** — the canonical top-level contract that composes `schema`, `ingestion_model`, and `bindings`. -- **Schema** — the declarative logical graph model (`Schema`): vertex/edge definitions, identities, typed fields, and DB profile. +- **Schema** — the declarative logical graph model (`Schema`): vertex/edge definitions, identities, typed **`properties`**, and DB profile. - **IngestionModel** — reusable resources and transforms used to map records into graph entities. - **Bindings** — named `FileConnector` / `TableConnector` / `SparqlConnector` list plus `resource_connector` (many rows per resource allowed: resource→0..n connectors) and optional `connector_connection` (connector **name** or **hash**→`conn_proxy` for runtime `ConnectionProvider` resolution without secrets in the manifest). Each connector exposes a **bound source modality** (`BoundSourceKind`: file, SQL table, SPARQL) for dispatch, distinct from the abstract ingestion **Resource**. - **Database-Independent Graph Representation** — a `GraphContainer` of vertices and edges, independent of any target database. @@ -243,7 +243,7 @@ classDiagram class Vertex { +name: str +identity: list~str~ - +fields: list~Field~ + +properties: list~Field~ +filters: FilterExpression? } @@ -261,9 +261,8 @@ classDiagram +source: str +target: str +identities: list~list~str~~ - +weights: WeightConfig? + +properties: list~Field~ +relation: str? - +relation_field: str? +filters: FilterExpression? } @@ -330,10 +329,11 @@ classDiagram IngestionModel *-- "0..*" ProtoTransform : transforms VertexConfig *-- "0..*" Vertex : vertices - Vertex *-- "0..*" Field : fields + Vertex *-- "0..*" Field : properties Vertex --> FilterExpression : filters EdgeConfig *-- "0..*" Edge : edges + Edge *-- "0..*" Field : properties Edge --> FilterExpression : filters Resource *-- ActorWrapper : root @@ -512,81 +512,66 @@ A `Vertex` describes vertices and their logical identity. It supports: ### Edge An `Edge` describes edges and their logical identities. It allows: - -- Definition at any level of a hierarchical document -- Reliance on vertex principal index -- Weight configuration using `direct` parameter (with optional type information) -- Optional uniqueness semantics through `identities` (multiple candidate keys are allowed) -### Edge Attributes and Configuration +- Optional uniqueness semantics through **`identities`** (multiple candidate keys are allowed) +- **`properties`**: relationship payload (names and optional types), same accepted forms as vertex properties (strings, `Field`, or dicts with at least `name`) +- Optional static **`relation`** label (e.g. Neo4j relationship type) when it is not derived at ingest time + +Ingestion-only controls (**`relation_field`**, **`relation_from_key`**, **`match_source`**, **`match_target`**, vertex-sourced edge payload) live on **`EdgeActor`** / **`EdgeRouterActor`** steps and **`EdgeDerivation`**, not on the logical `Edge` model. -Edges in graflo support a rich set of attributes that enable flexible relationship modeling: +### Edge properties and configuration -#### Basic Attributes +#### Basic logical fields - **`source`**: Source vertex name (required) - **`target`**: Target vertex name (required) - **`identities`**: Logical identity keys for the edge (each key can induce uniqueness) -- **`weights`**: Optional weight configuration for edge properties +- **`properties`**: Declared relationship attributes (typed or untyped) -**Neo4j, Memgraph, FalkorDB — relationship `MERGE` keys:** Writers match source and target nodes on vertex identity, then `MERGE` the relationship. Which **relationship properties** participate in that `MERGE` (so multiple edges between the same two vertices do not collapse) is derived as follows: use the **first** `identities` key, keep only tokens that refer to relationship payload (skip `source` and `target`; the `relation` token becomes the `relation` property on the relationship where used). If that produces no fields—e.g. `identities` is empty—the writer falls back to **all** names in `weights.direct`. Declare `identities` when direct weights are a superset of what should define edge uniqueness. +**Neo4j, Memgraph, FalkorDB — relationship `MERGE` keys:** Writers match source and target nodes on vertex identity, then `MERGE` the relationship. Which **relationship properties** participate in that `MERGE` (so multiple edges between the same two vertices do not collapse) is derived as follows: use the **first** `identities` key, keep only tokens that refer to relationship payload (skip `source` and `target`; the `relation` token becomes the `relation` property on the relationship where used). If that produces no fields—e.g. `identities` is empty—the writer falls back to **all** names in **`Edge.properties`**. Declare `identities` when the full property list is a superset of what should define edge uniqueness. -#### Relationship Type Configuration -- **`relation`**: Explicit relationship name (primarily for Neo4j) -- **`relation_field`**: Field name containing relationship type values (for CSV/tabular data) -- **`relation_from_key`**: Use JSON key names as relationship types (for nested JSON data) +#### Relationship type at ingest time +- **`relation`** on the logical edge: static relationship type when applicable +- **`relation_field`** on an **edge actor** step: column/field holding dynamic relationship type values (CSV/tabular; see Example 3) +- **`relation_from_key`** on an **edge actor** step: use JSON object keys as relationship types (nested JSON; see Example 4) -#### Weight Configuration -- **`weights.vertices`**: List of weight configurations from vertex properties -- **`weights.direct`**: List of direct field mappings as edge properties - - Can be specified as strings (backward compatible), `Field` objects with types, or dicts - - Supports typed fields: `Field(name="date", type="DATETIME")` or `{"name": "date", "type": "DATETIME"}` - - Type information enables better validation and database-specific optimizations -- **`weights.source_fields`**: Fields from source vertex to use as weights (deprecated) -- **`weights.target_fields`**: Fields from target vertex to use as weights (deprecated) +#### Payload from vertices at ingest time +Vertex fields that should appear on edges are configured via **edge actor** options (e.g. **`vertex_weights`**, maps), not via a `weights` block on the logical `Edge`. DB layers may still use an internal `WeightConfig` built from `Edge.properties` for backends that need it. -#### Edge Behavior Control +#### Edge behavior control - Edge physical variants should be modeled with `database_features.edge_specs[*].purpose`. - `Edge.aux` is no longer a behavior switch. > DB-only physical edge metadata (including `purpose`) is configured under > `database_features.edge_specs`, not on `Edge`. -#### Matching and Filtering -- **`match_source`**: Select source items from a specific branch of json -- **`match_target`**: Select target items from a specific branch of json -- **`match`**: General matching field for edge creation +#### Matching and filtering (ingestion) +- **`match_source`** / **`match_target`** / **`match`**: edge **actor** options for branch selection when building edges from hierarchical documents -#### Advanced Configuration +#### Advanced logical configuration - **`type`**: Edge type (DIRECT or INDIRECT) - **`by`**: Vertex name for indirect edges - DB-specific edge storage/type names are resolved from `database_features` through DB-aware wrappers (`EdgeConfigDBAware`), not stored on `Edge`. -#### When to Use Different Attributes +#### When to use what **`relation_field`** (Example 3): - -- Use with CSV/tabular data -- When relationship types are stored in a dedicated column -- For data like: `company_a, company_b, relation, date` + +- Set on the **`source` / `target` edge step** in the resource pipeline when relationship types live in a column (e.g. `company_a, company_b, relation, date`). **`relation_from_key`** (Example 4): - -- Use with nested JSON data -- When relationship types are implicit in the data structure -- For data like: `{"dependencies": {"depends": [...], "conflicts": [...]}}` -**`weights.direct`**: - -- Use when you want to add properties directly to edges -- For temporal data (dates), quantitative values, or metadata -- Can specify types for better validation: `weights: {direct: [{"name": "date", "type": "DATETIME"}, {"name": "confidence_score", "type": "FLOAT"}]}` -- Backward compatible with strings: `weights: {direct: ["date", "confidence_score"]}` +- Set on the edge step for nested JSON where keys imply relationship types. -**`match_source`/`match_target`**: - -- For scenarios where we have multiple leaves of json containing the same vertex class -- Example: Creating edges between specific subsets of vertices +**`properties` on the logical edge:** + +- Declare every relationship attribute you want in the schema (dates, scores, metadata). +- Typed example: `properties: [{name: date, type: DATETIME}, {name: confidence_score, type: FLOAT}]` +- String list: `properties: [date, confidence_score]` + +**`match_source` / `match_target`:** + +- Edge **actor** options when multiple branches feed the same vertex types; use to restrict which branches participate in an edge. ### DataSource & DataSourceRegistry An `AbstractDataSource` subclass defines where data comes from and how it is retrieved. Each carries a `DataSourceType`. The `DataSourceRegistry` maps data sources to Resources by name. @@ -808,22 +793,22 @@ Transform steps are executed in the order they appear in `apply`. ## Key Features ### Schema & Abstraction -- **Declarative LPG schema** — `Schema` defines vertices, edges, identity rules, and weights in YAML or Python; the single source of truth for graph structure. Transforms/resources are defined in `IngestionModel`. +- **Declarative LPG schema** — `Schema` defines vertices, edges, identity rules, and edge **`properties`** in YAML or Python; the single source of truth for graph structure. Transforms/resources are defined in `IngestionModel`. - **Database abstraction** — one logical schema, multiple backends; DB-specific behavior is applied in DB-aware projection/writer stages (`Schema.resolve_db_aware(...)`, `VertexConfigDBAware`, `EdgeConfigDBAware`). - **Resource abstraction** — each `Resource` is a reusable actor pipeline that maps raw records to graph elements, decoupled from data retrieval. - **DataSourceRegistry** — pluggable `AbstractDataSource` adapters (`FILE`, `SQL`, `API`, `SPARQL`, `IN_MEMORY`) bound to Resources by name. ### Schema Features - **Flexible Identity + Indexing** — logical identity plus DB-specific secondary indexes. -- **Typed Fields** — optional type information for vertex fields and edge weights (INT, FLOAT, STRING, DATETIME, BOOL). -- **Hierarchical Edge Definition** — define edges at any level of nested documents. -- **Weighted Edges** — configure edge weights from document fields or vertex properties with optional type information. +- **Typed properties** — optional type information on vertex and edge **`properties`** (INT, FLOAT, STRING, DATETIME, BOOL). +- **Hierarchical Edge Definition** — define edges at any level of nested documents (via resource **edge** steps and actors). +- **Relationship payload** — logical edges declare **`properties`**; additional payload from vertices or row shape is wired in **edge actors** (`vertex_weights`, maps, etc.) with optional types. - **Blank Vertices** — create intermediate vertices for complex relationships. - **Actor Pipeline** — process documents through a sequence of specialised actors (descend, transform, vertex, edge). - **Reusable Transforms** — define and reference transformations by name across Resources. - **Vertex Filtering** — filter vertices based on custom conditions. - **PostgreSQL Schema Inference** — infer schemas from normalised PostgreSQL databases (3NF) with PK/FK constraints. -- **RDF / OWL Schema Inference** — infer schemas from OWL/RDFS ontologies: `owl:Class` → vertices, `owl:ObjectProperty` → edges, `owl:DatatypeProperty` → vertex fields. +- **RDF / OWL Schema Inference** — infer schemas from OWL/RDFS ontologies: `owl:Class` → vertices, `owl:ObjectProperty` → edges, `owl:DatatypeProperty` → vertex **properties**. - **SelectSpec** — declarative view specification for advanced filtering and projection of SQL data before feeding into Resources. Use `TableConnector.view` with `SelectSpec` (full SQL-like `select` or `type_lookup` shorthand for symmetric edge lookups with `source_type` / `target_type` columns) to control exactly what data is queried. Per-side `source_table` / `target_table` / `source_identity` / `target_identity` / `source_type_column` / `target_type_column` cover different lookup tables or join keys. When one endpoint’s type is static in `EdgeRouterActorConfig` only, use `kind="select"` for the view. Use `kind="select"` whenever the shorthand is not expressive enough. ### Schema Migration (v1) @@ -843,8 +828,8 @@ When you compare schemas, treat it like comparing two building blueprints: Another useful analogy is `git diff`, but for graph structure: -- Additive changes (new vertex type, new edge, new field, new index) are similar to adding code in a backward-compatible way. -- Destructive changes (removing fields/types, identity shifts) are similar to breaking API changes: they often require explicit migration steps, data sweeps, or rollouts. +- Additive changes (new vertex type, new edge, new property, new index) are similar to adding code in a backward-compatible way. +- Destructive changes (removing properties/types, identity shifts) are similar to breaking API changes: they often require explicit migration steps, data sweeps, or rollouts. Practical comparison checklist: @@ -914,14 +899,14 @@ Schema comparison gives you a predictable transition path between versions. Inst 4. Configure appropriate batch sizes based on your data volume 5. Enable parallel processing for large datasets 6. Choose the right relationship attribute based on your data format: - - `relation_field` - extract relation from document field - - `relation_from_key` - extract relation from the key above - - `relation` for explicit relationship names -7. Use edge weights to capture temporal or quantitative relationship properties - - Specify types for weight fields when using databases that require type information (e.g., TigerGraph) - - Use typed `Field` objects or dicts with `type` key for better validation -8. Leverage key matching (`match_source`, `match_target`) for complex matching scenarios + - **`relation_field`** on an edge **actor** step — relation from a column/field + - **`relation_from_key`** on an edge **actor** step — relation from JSON keys + - **`relation`** on the logical edge — static relationship name when applicable +7. Use logical edge **`properties`** (and edge-actor payload options) for temporal or quantitative relationship attributes + - Specify types when the target DB requires them (e.g., TigerGraph) + - Use typed `Field` objects or dicts with a `type` key for better validation +8. Leverage key matching (`match_source`, `match_target`) on edge steps for complex matching scenarios 9. Use PostgreSQL schema inference for automatic schema generation from normalized databases (3NF) with proper PK/FK constraints 10. Use RDF/OWL schema inference (`infer_schema_from_rdf`) when ingesting data from SPARQL endpoints or `.ttl` files with a well-defined ontology -11. Specify field types for better validation and database-specific optimizations, especially when targeting TigerGraph +11. Specify property types for better validation and database-specific optimizations, especially when targeting TigerGraph diff --git a/docs/examples/example-1.md b/docs/examples/example-1.md index 6d732665..4fe3d178 100644 --- a/docs/examples/example-1.md +++ b/docs/examples/example-1.md @@ -17,19 +17,17 @@ Let's define vertices as ```yaml vertices: - name: person - fields: + properties: - id - name - age - indexes: - - fields: - - id + identity: + - id - name: department - fields: + properties: + - name + identity: - name - indexes: - - fields: - - name ``` and edges as @@ -49,7 +47,7 @@ Rendered graph: ![Rendered Graph](../assets/1-ingest-csv/figs/graph.png){ width="700" } -Let's define the mappings: we want to map document fields to vertex fields. Use vertex `from` to project document fields onto vertex fields and avoid name collisions (e.g. both `Person` and `Department` have a field called `name`): +Let's define the mappings: we want to map document keys onto vertex **properties**. Use vertex `from` to project source columns onto schema property names and avoid name collisions (e.g. both `Person` and `Department` have a property called `name`): ```yaml - name: people diff --git a/docs/examples/example-2.md b/docs/examples/example-2.md index f68d10c6..06766f4f 100644 --- a/docs/examples/example-2.md +++ b/docs/examples/example-2.md @@ -23,17 +23,23 @@ In this example we will be interested in how to create vertices `Work` and `Work Let's define vertices as ```yaml - vertices: - - name: work - fields: - - _key - - doi - indexes: - - fields: - - _key - - unique: false - fields: - - doi +# fragment of schema.graph — vertex logical model + secondary index profile +vertex_config: + vertices: + - name: work + properties: + - _key + - doi + - title + - created_date + identity: + - _key +db_profile: + vertex_indexes: + work: + - unique: false + fields: + - doi ``` The graph structure is quite simple: diff --git a/docs/examples/example-3.md b/docs/examples/example-3.md index a330851f..d87f5fe8 100644 --- a/docs/examples/example-3.md +++ b/docs/examples/example-3.md @@ -1,6 +1,6 @@ -# Example 3: CSV with Edge Weights and Multiple Relations +# Example 3: CSV with Edge Properties and Multiple Relations -This example demonstrates how to handle complex relationships where multiple edges can exist between the same pair of entities, each with different relation types and weights. +This example demonstrates how to handle complex relationships where multiple edges can exist between the same pair of entities, each with different relation types and relationship attributes. ## Data Structure @@ -20,39 +20,40 @@ We define a simple `company` vertex: vertex_config: vertices: - name: company - fields: + properties: + - name + identity: - name ``` ### Edges -The key feature here is using `relation_field` to dynamically create different edge types: +Logical edges declare **`properties`** (relationship attributes) and, when needed, an **`identities`** key so parallel relationships stay distinct. Dynamic relationship **types** from a CSV column are configured on the **edge step** in the resource pipeline with **`relation_field`** (not on the logical `Edge`). ```yaml edge_config: edges: - source: company target: company - relation_field: relation - weights: - direct: - - date + identities: + - - relation + properties: + - date ``` ## Key Concepts -### `relation_field` Attribute -The `relation_field: relation` tells graflo to: +### `relation_field` on the edge step +In the resource pipeline, `relation_field: relation` on the `source`/`target` step tells GraFlo to: - Read the `relation` column from the CSV -- Create different edge types based on the values in that column -- Instead of a single edge type, we get multiple edge types: `invests_in`, `partners_with`, `acquires`, etc. +- Use its values as the relationship type for that row +- Produce multiple relationship types from one edge definition: `invests_in`, `partners_with`, `acquires`, etc. -### Edge Weights -The `weights.direct: [date]` configuration: +### Edge `properties` +The `properties: [date]` entry on the edge: -- Adds the `date` field as a weight property on each edge -- This allows temporal analysis of relationships -- The date becomes a property that can be used for filtering, sorting, or analysis +- Declares `date` as a relationship attribute on each emitted edge +- Supports temporal analysis and filtering/sorting on that attribute ## Resource Mapping @@ -66,6 +67,9 @@ resources: "from": {name: company_a} - vertex: company "from": {name: company_b} + - source: company + target: company + relation_field: relation ``` This creates two company vertices for each row and establishes the relationship between them. @@ -115,11 +119,13 @@ from graflo.architecture.contract.bindings import FileConnector import pathlib bindings = Bindings() -people_connector = FileConnector(regex="^relations.*\.csv$", sub_path=pathlib.Path(".")) -bindings.add_connector( - people_connector, +relations_connector = FileConnector( + name="relations_files", + regex="^relations.*\\.csv$", + sub_path=pathlib.Path("."), ) -bindings.bind_resource("people", people_connector) +bindings.add_connector(relations_connector) +bindings.bind_resource("relations", relations_connector) from graflo.hq.caster import IngestionParams @@ -147,9 +153,9 @@ This connector is particularly useful for: ## Key Takeaways -1. **`relation_field`** enables dynamic edge type creation from data +1. **`relation_field`** on the edge step enables dynamic relationship types from data 2. **Multiple edges** can exist between the same vertex pair -3. **Edge weights** add temporal or quantitative properties to relationships +3. **Edge `properties`** declare temporal or quantitative attributes on relationships 4. **Flexible modeling** supports complex real-world business scenarios Please refer to [examples](https://github.com/growgraph/graflo/tree/main/examples/3-ingest-csv-edge-weights) diff --git a/docs/examples/example-4.md b/docs/examples/example-4.md index fe1a939d..b8eb08ee 100644 --- a/docs/examples/example-4.md +++ b/docs/examples/example-4.md @@ -40,28 +40,25 @@ We define three vertex types: vertex_config: vertices: - name: package - fields: + properties: - name - version - indexes: - - fields: - - name + identity: + - name - name: maintainer - fields: + properties: - name - email - indexes: - - fields: - - email + identity: + - email - name: bug - fields: + properties: - id - subject - severity - date - indexes: - - fields: - - id + identity: + - id ``` ### Edges diff --git a/docs/examples/example-5.md b/docs/examples/example-5.md index b2c100be..610332e2 100644 --- a/docs/examples/example-5.md +++ b/docs/examples/example-5.md @@ -89,7 +89,7 @@ The inference engine uses intelligent heuristics to classify tables. **These heu - **Require**: 2+ foreign key (FK) constraints decorated on the table - Foreign keys represent relationships between entities -- May have additional attributes (weights, timestamps, quantities) +- May have additional columns (timestamps, quantities, etc.) that become edge **properties** - Represent relationships or transactions between entities - Foreign keys point to vertex tables and become edge source/target mappings @@ -116,13 +116,13 @@ PostgreSQL types are automatically mapped to graflo Field types with proper type The inferred schema automatically includes: -- **Vertices**: `users`, `products` (with typed fields matching PostgreSQL columns) -- **Edges**: - - `users → products` (from `purchases` table) with weights: `purchase_date`, `quantity`, `total_amount` - - `users → users` (from `follows` table) with weight: `created_at` +- **Vertices**: `users`, `products` (with typed **properties** matching PostgreSQL columns) +- **Edges**: + - `users → products` (from `purchases` table) with **properties**: `purchase_date`, `quantity`, `total_amount` + - `users → users` (from `follows` table) with **properties**: `created_at` - **Resources**: Automatically created for each table with appropriate actors -- **Indexes**: Primary keys become vertex indexes, foreign keys become edge indexes -- **Weights**: Additional columns in edge tables become edge weight properties +- **Indexes**: Primary keys drive vertex identity / indexing; foreign keys drive edge mappings (see `database_features` for secondary indexes) +- **Edge payload**: Additional columns in edge tables become edge **properties** on the logical `Edge` ### Graph Structure Visualization @@ -139,7 +139,7 @@ This diagram shows: ### Vertex Fields Structure -Each vertex includes typed fields inferred from PostgreSQL columns: +Each vertex includes typed **properties** inferred from PostgreSQL columns: ![Vertex Fields](../assets/5-ingest-postgres/figs/public_vc2fields.png){ width="500" } @@ -288,11 +288,11 @@ ingestion_model = manifest.require_ingestion_model() The inferred schema will have: -- **Vertices**: `users`, `products` with typed fields matching PostgreSQL column types +- **Vertices**: `users`, `products` with typed **properties** matching PostgreSQL column types - **Edges**: - - `users → products` (from `purchases` table) with weight properties - - `users → users` (from `follows` table) with weight properties + - `users → products` (from `purchases` table) with edge **properties** from non-FK columns + - `users → users` (from `follows` table) with edge **properties** from non-FK columns - **Resources**: Automatically created in `ingestion_model` for each table with appropriate actors **What happens during inference:** @@ -570,7 +570,7 @@ schema: vertex_config: vertices: - name: products - fields: + properties: - name: id type: INT - name: name @@ -581,10 +581,9 @@ schema: type: STRING - name: created_at type: DATETIME - indexes: - - fields: [id] + identity: [id] - name: users - fields: + properties: - name: id type: INT - name: name @@ -593,22 +592,23 @@ schema: type: STRING - name: created_at type: DATETIME - indexes: - - fields: [id] + identity: [id] edge_config: edges: - source: users target: products - weights: - direct: - - name: purchase_date - - name: quantity - - name: total_amount + properties: + - name: purchase_date + type: DATETIME + - name: quantity + type: INT + - name: total_amount + type: FLOAT - source: users target: users - weights: - direct: - - name: created_at + properties: + - name: created_at + type: DATETIME ingestion_model: resources: - name: products @@ -657,9 +657,9 @@ Resources are automatically created for each table with appropriate actors: - **Vertex tables**: Create `VertexActor` to map rows to vertices - **Edge tables**: Create `EdgeActor` with proper field mappings for source and target vertices -### Type-Safe Field Definitions +### Type-safe property definitions -All fields in the inferred schema include type information, enabling: +All **properties** in the inferred schema include type information, enabling: - Better validation during ingestion - Database-specific optimizations @@ -848,7 +848,7 @@ After successful ingestion, you can explore your graph data using each database' 2. **Type mapping** ensures proper type handling across PostgreSQL and graph databases 3. **Direct database access** enables efficient data ingestion without intermediate files 4. **Flexible heuristics** automatically detect vertices and edges from table structure -5. **Type-safe fields** provide better validation and database-specific optimizations +5. **Type-safe properties** provide better validation and database-specific optimizations 6. **Resource generation** automatically creates appropriate actors for each table 7. **Schema customization** allows modifications after inference for specific use cases 8. **Multiple database support** allows you to try different graph databases and compare results diff --git a/docs/examples/example-6.md b/docs/examples/example-6.md index a65e7a9c..3d6dd75c 100644 --- a/docs/examples/example-6.md +++ b/docs/examples/example-6.md @@ -21,7 +21,7 @@ The example models a small academic knowledge graph with three entity types and ### Ontology (TBox) — `data/ontology.ttl` -The ontology declares classes, datatype properties (vertex fields), and object properties (edges) using standard OWL vocabulary: +The ontology declares classes, datatype properties (vertex **properties**), and object properties (edges) using standard OWL vocabulary: ```turtle @prefix owl: . @@ -34,7 +34,7 @@ ex:Researcher a owl:Class . ex:Publication a owl:Class . ex:Institution a owl:Class . -# Datatype properties (become vertex fields) +# Datatype properties (become vertex properties) ex:fullName a owl:DatatypeProperty ; rdfs:domain ex:Researcher ; rdfs:range xsd:string . ex:orcid a owl:DatatypeProperty ; rdfs:domain ex:Researcher ; rdfs:range xsd:string . ex:title a owl:DatatypeProperty ; rdfs:domain ex:Publication ; rdfs:range xsd:string . @@ -129,7 +129,7 @@ schema, ingestion_model = engine.infer_schema_from_rdf( **What happens during inference:** 1. **Class discovery** — `owl:Class` declarations become **vertices** (`Researcher`, `Publication`, `Institution`) -2. **Field discovery** — `owl:DatatypeProperty` declarations with `rdfs:domain` become **fields** on the corresponding vertex, plus automatic `_key` and `_uri` fields +2. **Property discovery** — `owl:DatatypeProperty` declarations with `rdfs:domain` become **properties** on the corresponding vertex, plus automatic `_key` and `_uri` properties 3. **Edge discovery** — `owl:ObjectProperty` declarations with `rdfs:domain` / `rdfs:range` become **edges** (`authorOf`, `affiliatedWith`, `cites`) 4. **Resource creation** — One resource per class is created, wiring the vertex and its outgoing edges @@ -145,11 +145,11 @@ schema: vertex_config: vertices: - name: Researcher - fields: [_key, _uri, fullName, orcid] + properties: [_key, _uri, fullName, orcid] - name: Publication - fields: [_key, _uri, title, year, doi] + properties: [_key, _uri, title, year, doi] - name: Institution - fields: [_key, _uri, instName, country] + properties: [_key, _uri, instName, country] edge_config: edges: - source: Researcher @@ -369,7 +369,7 @@ These modes are mutually exclusive. Use file mode for small-to-medium datasets s ## Key Takeaways -1. **OWL ontology inference** eliminates manual schema definition — `owl:Class` becomes vertices, `owl:DatatypeProperty` becomes fields, `owl:ObjectProperty` becomes edges +1. **OWL ontology inference** eliminates manual schema definition — `owl:Class` becomes vertices, `owl:DatatypeProperty` becomes vertex **properties**, `owl:ObjectProperty` becomes edges 2. **Explicit `SparqlConnector` mapping** gives full control over which class URI maps to which resource and data source 3. **Local file and remote endpoint** modes are both supported via the same `SparqlConnector` abstraction 4. **No intermediate formats** — RDF triples are converted directly to flat dicts and ingested into the graph database diff --git a/docs/examples/index.md b/docs/examples/index.md index 47fc754e..09b95c9c 100644 --- a/docs/examples/index.md +++ b/docs/examples/index.md @@ -7,5 +7,5 @@ 5. **[🚀 PostgreSQL Schema Inference and Ingestion](example-5.md)** - **Automatically infer graph schemas from normalized PostgreSQL databases (3NF)** with proper primary keys (PK) and foreign keys (FK). Uses intelligent heuristics to detect vertices and edges - no manual schema definition needed! Perfect for migrating relational data to graph databases. 6. **[🔗 RDF / Turtle Ingestion with Explicit Resource Mapping](example-6.md)** - **Infer graph schemas from OWL ontologies and ingest RDF data** using explicit `SparqlConnector` resource mapping. Supports local Turtle files and remote SPARQL endpoints. Perfect for knowledge graph pipelines built on semantic web standards. 7. **[Polymorphic Objects and Relations](example-7.md)** — **Route polymorphic entities and dynamic relations** using `vertex_router` and `edge_router`. One objects table (Person, Vehicle, Institution) and one relations table (EMPLOYED_BY, OWNS, FUNDS, etc.) map to a rich graph with type discriminators and `relation_map`. -8. **[Multi-Edge Weights with Filters and `dress` Transforms](example-8.md)** — **Ticker-style CSV to Neo4j** with vertex filters, multiple edge weights, and `dress`-scoped transforms on metric name/value pairs. +8. **[Multi-edge properties with filters and `dress` transforms](example-8.md)** — **Ticker-style CSV to Neo4j** with vertex filters, rich relationship payload, and `dress`-scoped transforms on metric name/value pairs. 9. **[Explicit `connector_connection` Proxy Wiring](example-9.md)** — Show how manifest proxy labels (`conn_proxy`) are resolved at runtime into real DB configs via `ConnectionProvider`. \ No newline at end of file diff --git a/docs/getting_started/creating_manifest.md b/docs/getting_started/creating_manifest.md index ef687ce9..3b386fc0 100644 --- a/docs/getting_started/creating_manifest.md +++ b/docs/getting_started/creating_manifest.md @@ -28,10 +28,10 @@ schema: vertex_config: vertices: - name: person - fields: [id, name, age] + properties: [id, name, age] identity: [id] - name: department - fields: [name] + properties: [name] identity: [name] edge_config: edges: @@ -62,8 +62,8 @@ bindings: {} Defines the graph contract. - `metadata`: human-facing identity (`name`, optional `version`) -- `graph.vertex_config`: vertex types, fields, identity keys -- `graph.edge_config`: source/target relationships, optional relation/weights +- `graph.vertex_config`: vertex types, **`properties`**, identity keys +- `graph.edge_config`: source/target relationships, optional `relation`, edge **`properties`**, `identities` - `db_profile`: DB-specific physical behavior (indexes, naming, backend details) Use `schema` for **what graph exists**. diff --git a/docs/index.md b/docs/index.md index b790bcfd..7f8d2bae 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,8 +1,13 @@ # GraFlo graflo logo -GraFlo is a **Graph Schema Transformation Language (GSTL)** for Labeled Property Graphs (LPG) - a domain-specific language (DSL) for defining graph structure and transformation logic in one manifest. +**GraFlo** is a **Python package** and **manifest format** (`GraphManifest`: YAML **`schema`** + **`ingestion_model`** + **`bindings`**) for **labeled property graphs**. It is a **Graph Schema & Transformation Language (GSTL)**: you encode the LPG **once** at the logical layer (vertices, edges, typed **`properties`**, identity), express **how** records become graph elements with **`Resource`** actor pipelines, and **project** that model per backend before load. **`GraphEngine`** covers inference, DDL, and ingest; **`Caster`** focuses on batching records into a **`GraphContainer`** and **`DBWriter`**. -It combines a database-independent graph model, DB-specific details, and ingestion pipeline into a graph manifest and runs it across many systems. With declarative schemas and reusable `Resource` pipelines, GraFlo maps CSV/SQL, JSON/XML, RDF/SPARQL, REST APIs, and in-memory data into a single database-independent LPG model (`GraphContainer`), then projects it to supported graph databases: ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, and NebulaGraph. This keeps schema and transform logic portable across targets and helps teams avoid vendor lock-in. +## Why GraFlo + +- **DB-agnostic LPG** — The **logical schema** describes an LPG independent of ArangoDB, Neo4j, Cypher-family stores, TigerGraph, and so on. You do not fork your “graph design” per vendor; you fork only **projection** and connectors. +- **Expressive, composable transforms** — **`Resource`** pipelines chain **actors** (descend into nested data, apply named **transforms**, emit **vertices** and **edges**, route by type with **VertexRouter** / **EdgeRouter**). The same pipeline can be bound to CSV, PostgreSQL, SPARQL, or an API via **`Bindings`**. +- **Clear boundaries** — **`Schema`** is structure only. **`IngestionModel`** holds resources and shared transforms. **`Bindings`** map ingestion resource names to one or more **connectors** and optional **`conn_proxy`** labels—so manifests stay credential-free at rest. +- **Multi-target ingestion** — One code path and manifest can target **ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, or NebulaGraph**; backend quirks are handled in **DB-aware** types and writers, not in your logical model. ![Python](https://img.shields.io/badge/python-3.11%2B-blue.svg) [![PyPI version](https://badge.fury.io/py/graflo.svg)](https://badge.fury.io/py/graflo) @@ -16,13 +21,13 @@ It combines a database-independent graph model, DB-specific details, and ingesti ## Pipeline -**Source Instance** → **Resource** → **Graph Schema** → **Covariant Graph Representation** → **Graph DB** +**Source Instance** → **Resource** (actor pipeline) → **Logical Graph Schema** → **Covariant Graph Representation** (`GraphContainer`) → **DB-aware Projection** → **Graph DB** | Stage | Role | Code | |-------|------|------| | **Source Instance** | A concrete data artifact — a CSV file, a PostgreSQL table, a SPARQL endpoint, a `.ttl` file. | `AbstractDataSource` subclasses with a `DataSourceType` (`FILE`, `SQL`, `SPARQL`, `API`, `IN_MEMORY`). | | **Resource** | A reusable transformation pipeline — actor steps (descend, transform, vertex, edge, vertex_router, edge_router) that map raw records to graph elements. Data sources bind to Resources by name via the `DataSourceRegistry`. | `Resource` (part of `IngestionModel`). | -| **Graph Schema** | Declarative logical vertex/edge definitions, identities, typed fields, and DB profile. | `Schema`, `VertexConfig`, `EdgeConfig`. | +| **Graph Schema** | Declarative logical vertex/edge definitions, identities, typed **properties**, and DB profile. | `Schema`, `VertexConfig`, `EdgeConfig`. | | **Covariant Graph Representation** | A database-independent collection of vertices and edges. | `GraphContainer`. | | **DB-aware Projection** | Resolves DB-specific naming/default/index behavior from logical schema + `DatabaseProfile`. | `Schema.resolve_db_aware()`, `VertexConfigDBAware`, `EdgeConfigDBAware`. | | **Graph DB** | The target LPG store — same API for all supported databases. | `ConnectionManager`, `DBWriter`, DB connectors. | @@ -33,15 +38,15 @@ It combines a database-independent graph model, DB-specific details, and ingesti GraFlo targets the LPG model: -- **Vertices** — nodes with typed properties and unique identifiers. -- **Edges** — directed relationships between vertices, carrying their own properties and weights. +- **Vertices** — nodes with typed **properties** (manifest key: `properties`) and logical **identity** keys for upserts. +- **Edges** — directed relationships between vertices; relationship attributes are declared as **`properties`** on the logical edge (same list-of-names-or-`Field` shape as vertices). ### Schema The Schema is the single source of truth for the graph structure: -- **Vertex definitions** — vertex types, fields (optionally typed: `INT`, `FLOAT`, `STRING`, `DATETIME`, `BOOL`), and indexes. -- **Edge definitions** — relationships between vertex types, with optional weight fields. +- **Vertex definitions** — vertex types, **`properties`** (optionally typed: `INT`, `FLOAT`, `STRING`, `DATETIME`, `BOOL`), identity, and filters; secondary indexes live under **`database_features`**. +- **Edge definitions** — source/target (and optional `relation`), **`properties`** for relationship payload, and optional **`identities`** for parallel-edge / MERGE semantics. - **Schema inference** — generate schemas from PostgreSQL 3NF databases (PK/FK heuristics) or from OWL/RDFS ontologies. Resources and transforms are part of `IngestionModel`, not `Schema`. @@ -78,7 +83,7 @@ The `DataSourceRegistry` manages `AbstractDataSource` adapters, each carrying a ## Key Features -- **Declarative LPG schema DSL** — Define vertices, edges, indexes, weights, and transforms in YAML or Python. The `Schema` is the single source of truth, independent of source or target. +- **Declarative LPG schema DSL** — Define vertices, edges, indexes, edge **properties**, and transforms in YAML or Python. The `Schema` is the single source of truth, independent of source or target. - **Database abstraction** — One logical schema and transformation DSL, one API. Target ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, or NebulaGraph without rewriting pipelines. DB idiosyncrasies are handled in DB-aware projection (`Schema.resolve_db_aware(...)`) and connector/writer stages. - **Resource abstraction** — Each `Resource` defines a reusable actor pipeline that maps raw records to graph elements. Actor types include descend, transform, vertex, edge, plus **VertexRouter** and **EdgeRouter** for dynamic type-based routing (see [Concepts — Actor](concepts/index.md#actor)). Data sources bind to Resources by name via the `DataSourceRegistry`, decoupling transformation logic from data retrieval. - **DataSourceRegistry** — Register `FILE`, `SQL`, `API`, `IN_MEMORY`, or `SPARQL` data sources. Each `DataSourceType` plugs into the same Resource pipeline. @@ -86,7 +91,7 @@ The `DataSourceRegistry` manages `AbstractDataSource` adapters, each carrying a - **Schema inference** — Generate graph schemas from PostgreSQL 3NF databases (PK/FK heuristics) or from OWL/RDFS ontologies. See [Example 5](examples/example-5.md). - **Schema migration planning/execution** — Generate typed migration plans between schema versions, apply low-risk additive changes with risk gates, and track revision history via `migrate_schema`. - Compare `from` and `to` schemas before execution to preview structural deltas and blocked high-risk operations. -- **Typed fields** — Vertex fields and edge weights carry types for validation and database-specific optimisation. +- **Typed properties** — Vertex and edge **`properties`** carry optional types for validation and database-specific optimisation. - **Parallel batch processing** — Configurable batch sizes and multi-core execution. - **Advanced filtering** — Server-side filtering (e.g. TigerGraph REST++ API), client-side filter expressions, and **SelectSpec** for declarative SQL view/filter control before data reaches Resources. - **Blank vertices** — Create intermediate nodes for complex relationship modelling. diff --git a/docs/reference/architecture/contract/declarations/edge_derivation_registry.md b/docs/reference/architecture/contract/declarations/edge_derivation_registry.md new file mode 100644 index 00000000..d8d17a78 --- /dev/null +++ b/docs/reference/architecture/contract/declarations/edge_derivation_registry.md @@ -0,0 +1,3 @@ +# `graflo.architecture.contract.declarations.edge_derivation_registry` + +::: graflo.architecture.contract.declarations.edge_derivation_registry diff --git a/docs/reference/architecture/edge_derivation.md b/docs/reference/architecture/edge_derivation.md new file mode 100644 index 00000000..aee6abc6 --- /dev/null +++ b/docs/reference/architecture/edge_derivation.md @@ -0,0 +1,3 @@ +# `graflo.architecture.edge_derivation` + +::: graflo.architecture.edge_derivation diff --git a/examples/1-ingest-csv/manifest.yaml b/examples/1-ingest-csv/manifest.yaml index 76b329cf..40d33173 100644 --- a/examples/1-ingest-csv/manifest.yaml +++ b/examples/1-ingest-csv/manifest.yaml @@ -5,14 +5,14 @@ schema: vertex_config: vertices: - name: person - fields: + properties: - id - name - age identity: - id - name: department - fields: + properties: - name identity: - name diff --git a/examples/2-ingest-self-references/ingest.py b/examples/2-ingest-self-references/ingest.py index beb5da0e..ba543473 100644 --- a/examples/2-ingest-self-references/ingest.py +++ b/examples/2-ingest-self-references/ingest.py @@ -31,7 +31,7 @@ bindings = Bindings( connectors=[ - FileConnector(name="openalex", regex=r"\\Sjson$", sub_path=pathlib.Path(".")) + FileConnector(name="openalex", regex=r"\Sjson$", sub_path=pathlib.Path(".")) ], resource_connector=[{"resource": "work", "connector": "openalex"}], ) diff --git a/examples/2-ingest-self-references/manifest.yaml b/examples/2-ingest-self-references/manifest.yaml index 0e385655..d9715d45 100644 --- a/examples/2-ingest-self-references/manifest.yaml +++ b/examples/2-ingest-self-references/manifest.yaml @@ -5,7 +5,7 @@ schema: vertex_config: vertices: - name: work - fields: + properties: - _key - doi - title @@ -39,8 +39,6 @@ ingestion_model: call: use: keep_suffix_id - vertex: work - - source: work - target: work transforms: - name: keep_suffix_id foo: split_keep_part diff --git a/examples/3-ingest-csv-edge-weights/ingest.py b/examples/3-ingest-csv-edge-weights/ingest.py index 4ad579ef..c45a281c 100644 --- a/examples/3-ingest-csv-edge-weights/ingest.py +++ b/examples/3-ingest-csv-edge-weights/ingest.py @@ -1,8 +1,8 @@ from suthing import FileHandle from graflo import GraphManifest -from graflo.db import Neo4jConfig from graflo.hq import GraphEngine from graflo.hq.caster import IngestionParams +from graflo.db import TigergraphConfig import logging @@ -19,10 +19,12 @@ # Load config from docker/neo4j/.env (recommended) # This automatically reads NEO4J_BOLT_PORT, NEO4J_AUTH, etc. -conn_conf = Neo4jConfig.from_docker_env() -# from graflo.db import TigergraphConfig -# conn_conf = TigergraphConfig.from_docker_env() +# from graflo.db import Neo4jConfig +# conn_conf = Neo4jConfig.from_docker_env() + +conn_conf = TigergraphConfig.from_docker_env() +conn_conf.max_job_size = 5000 # Alternative: Create config directly or use environment variables # Set NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_BOLT_PORT env vars diff --git a/examples/3-ingest-csv-edge-weights/manifest.yaml b/examples/3-ingest-csv-edge-weights/manifest.yaml index ca7bef62..5a3b9ee1 100644 --- a/examples/3-ingest-csv-edge-weights/manifest.yaml +++ b/examples/3-ingest-csv-edge-weights/manifest.yaml @@ -5,7 +5,7 @@ schema: vertex_config: vertices: - name: company - fields: + properties: - name identity: - name @@ -13,10 +13,10 @@ schema: edges: - source: company target: company - relation_field: relation - weights: - direct: - - date + identities: + - - relation + properties: + - date db_profile: {} ingestion_model: resources: @@ -28,6 +28,9 @@ ingestion_model: - vertex: company from: name: company_b + - source: company + target: company + relation_field: relation bindings: connectors: - name: relations_files diff --git a/examples/4-ingest-neo4j/manifest.yaml b/examples/4-ingest-neo4j/manifest.yaml index ded23704..8f185566 100644 --- a/examples/4-ingest-neo4j/manifest.yaml +++ b/examples/4-ingest-neo4j/manifest.yaml @@ -6,19 +6,19 @@ schema: vertex_config: vertices: - name: package - fields: + properties: - name - version identity: - name - name: maintainer - fields: + properties: - name - email identity: - email - name: bug - fields: + properties: - id - subject - severity diff --git a/examples/5-ingest-postgres/generated-manifest.yaml b/examples/5-ingest-postgres/generated-manifest.yaml index cf205276..9475b026 100644 --- a/examples/5-ingest-postgres/generated-manifest.yaml +++ b/examples/5-ingest-postgres/generated-manifest.yaml @@ -1,27 +1,28 @@ core_schema: edge_config: edges: - - relation: follows + - properties: + - name: created_at + type: DATETIME + relation: follows source: users target: users - weights: - direct: - - name: created_at - type: DATETIME - - relation: purchases + - properties: + - name: purchase_date + type: DATETIME + - name: quantity + type: INT + - name: total_amount + type: FLOAT + relation: purchases source: users target: products - weights: - direct: - - name: purchase_date - type: DATETIME - - name: quantity - type: INT - - name: total_amount - type: FLOAT vertex_config: vertices: - - fields: + - identity: + - id + name: products + properties: - name: id type: INT - name: name @@ -32,10 +33,10 @@ core_schema: type: STRING - name: created_at type: DATETIME - identity: + - identity: - id - name: products - - fields: + name: users + properties: - name: id type: INT - name: name @@ -44,9 +45,6 @@ core_schema: type: STRING - name: created_at type: DATETIME - identity: - - id - name: users db_profile: db_flavor: tigergraph vertex_storage_names: diff --git a/examples/6-ingest-rdf/generated-manifest.yaml b/examples/6-ingest-rdf/generated-manifest.yaml index 54151643..aa752dc1 100644 --- a/examples/6-ingest-rdf/generated-manifest.yaml +++ b/examples/6-ingest-rdf/generated-manifest.yaml @@ -1,4 +1,4 @@ -graph: +core_schema: edge_config: edges: - relation: authorOf @@ -12,40 +12,40 @@ graph: target: Publication vertex_config: vertices: - - fields: - - name: _key - - name: _uri - - name: title - - name: year - - name: doi - identity: + - identity: - _key - _uri - - title - - year - - doi - name: Publication - - fields: + - fullName + - orcid + name: Researcher + properties: - name: _key - name: _uri - - name: instName - - name: country - identity: + - name: fullName + - name: orcid + - identity: - _key - _uri - instName - country name: Institution - - fields: + properties: - name: _key - name: _uri - - name: fullName - - name: orcid - identity: + - name: instName + - name: country + - identity: - _key - _uri - - fullName - - orcid - name: Researcher + - title + - year + - doi + name: Publication + properties: + - name: _key + - name: _uri + - name: title + - name: year + - name: doi metadata: name: academic_kg diff --git a/examples/6-ingest-rdf/generated-schema.yaml b/examples/6-ingest-rdf/generated-schema.yaml index 54151643..dbb1f233 100644 --- a/examples/6-ingest-rdf/generated-schema.yaml +++ b/examples/6-ingest-rdf/generated-schema.yaml @@ -12,7 +12,7 @@ graph: target: Publication vertex_config: vertices: - - fields: + - properties: - name: _key - name: _uri - name: title @@ -25,7 +25,7 @@ graph: - year - doi name: Publication - - fields: + - properties: - name: _key - name: _uri - name: instName @@ -36,7 +36,7 @@ graph: - instName - country name: Institution - - fields: + - properties: - name: _key - name: _uri - name: fullName diff --git a/examples/7-objects-relations/manifest.yaml b/examples/7-objects-relations/manifest.yaml index 11f1a374..105d8137 100644 --- a/examples/7-objects-relations/manifest.yaml +++ b/examples/7-objects-relations/manifest.yaml @@ -5,7 +5,7 @@ schema: vertex_config: vertices: - name: person - fields: + properties: - id - name - age @@ -16,7 +16,7 @@ schema: identity: - id - name: vehicle - fields: + properties: - id - name - license_plate @@ -27,7 +27,7 @@ schema: identity: - id - name: institution - fields: + properties: - id - name - email diff --git a/examples/8-multi-edges-weights/manifest.yaml b/examples/8-multi-edges-weights/manifest.yaml index 3265194e..f8f9eafd 100644 --- a/examples/8-multi-edges-weights/manifest.yaml +++ b/examples/8-multi-edges-weights/manifest.yaml @@ -5,12 +5,12 @@ schema: vertex_config: vertices: - name: ticker - fields: + properties: - oftic identity: - oftic - name: metric - fields: + properties: - name - value identity: @@ -35,13 +35,8 @@ schema: edges: - source: ticker target: metric - weights: - direct: - - t_obs - vertices: - - name: metric - fields: - - name + properties: + - t_obs db_profile: vertex_storage_names: ticker: tickers @@ -111,5 +106,14 @@ ingestion_model: - transform: rename: ticker: oftic + - vertex: ticker - vertex: metric + - source: ticker + target: metric + properties: + - t_obs + vertex_weights: + - name: metric + fields: + - name bindings: {} diff --git a/graflo/architecture/contract/declarations/edge_derivation_registry.py b/graflo/architecture/contract/declarations/edge_derivation_registry.py new file mode 100644 index 00000000..377c9e0d --- /dev/null +++ b/graflo/architecture/contract/declarations/edge_derivation_registry.py @@ -0,0 +1,64 @@ +"""Runtime registry for ingestion-only edge derivation (per resource).""" + +from __future__ import annotations + +import json +from typing import Any + +from graflo.architecture.graph_types import EdgeId, Weight + + +class EdgeDerivationRegistry: + """Mutable store for ingestion-time edge behavior keyed by :class:`EdgeId`. + + Lives under the ingestion layer (typically one instance per :class:`Resource`), + not on :class:`~graflo.architecture.schema.core.CoreSchema`. + """ + + def __init__(self) -> None: + self._relation_from_key: dict[EdgeId, bool] = {} + self._vertex_weights: dict[EdgeId, list[Weight]] = {} + + def mark_relation_from_key(self, edge_id: EdgeId) -> None: + self._relation_from_key[edge_id] = True + + def uses_relation_from_key(self, edge_id: EdgeId) -> bool: + return self._relation_from_key.get(edge_id, False) + + def merge_vertex_weights(self, edge_id: EdgeId, rules: list[Weight]) -> None: + """Append vertex weight rules for *edge_id*, deduplicating by stable fingerprint.""" + if not rules: + return + bucket = self._vertex_weights.setdefault(edge_id, []) + seen = {_weight_fingerprint(w) for w in bucket} + for w in rules: + fp = _weight_fingerprint(w) + if fp in seen: + continue + seen.add(fp) + bucket.append(w) + + def vertex_weights_for(self, edge_id: EdgeId) -> list[Weight]: + return list(self._vertex_weights.get(edge_id, ())) + + def copy(self) -> EdgeDerivationRegistry: + out = EdgeDerivationRegistry() + out._relation_from_key = dict(self._relation_from_key) + out._vertex_weights = { + k: [w.model_copy(deep=True) for w in v] + for k, v in self._vertex_weights.items() + } + return out + + def merge_from(self, other: EdgeDerivationRegistry) -> None: + for eid, flag in other._relation_from_key.items(): + if flag: + self.mark_relation_from_key(eid) + for eid, weights in other._vertex_weights.items(): + self.merge_vertex_weights(eid, weights) + + +def _weight_fingerprint(w: Weight) -> str: + """JSON-stable fingerprint for deduplication.""" + payload: dict[str, Any] = w.model_dump(mode="json") + return json.dumps(payload, sort_keys=True, default=str) diff --git a/graflo/architecture/contract/declarations/ingestion_model/model.py b/graflo/architecture/contract/declarations/ingestion_model/model.py index ddc39073..0f69a811 100644 --- a/graflo/architecture/contract/declarations/ingestion_model/model.py +++ b/graflo/architecture/contract/declarations/ingestion_model/model.py @@ -3,12 +3,14 @@ from __future__ import annotations from collections import Counter -from typing import TYPE_CHECKING +from typing import TYPE_CHECKING, Literal from pydantic import Field as PydanticField, PrivateAttr, model_validator from graflo.architecture.base import ConfigBaseModel +from graflo.onto import DBType +from ..edge_derivation_registry import EdgeDerivationRegistry from ..resource import Resource from ..transform import ProtoTransform @@ -19,6 +21,16 @@ class IngestionModel(ConfigBaseModel): """Ingestion model (C): resources and transform registry.""" + edges_on_duplicate: Literal["ignore", "upsert"] = PydanticField( + default="ignore", + description=( + "How batch edge writes tolerate an already-matching edge. Passed through to " + ":meth:`~graflo.db.conn.Connection.insert_edges_batch` where the target backend " + "supports it. Today ArangoDB maps ``ignore`` to INSERT with ignoreErrors and " + "``upsert`` to AQL UPSERT (with schema merge keys as ``uniq_weight_fields`` when " + "present). Other databases may interpret the same values later." + ), + ) resources: list[Resource] = PydanticField( default_factory=list, description="List of resource definitions (data pipelines mapping to vertices/edges).", @@ -30,6 +42,9 @@ class IngestionModel(ConfigBaseModel): _resources: dict[str, Resource] = PrivateAttr() _transforms: dict[str, ProtoTransform] = PrivateAttr(default_factory=dict) + _combined_edge_derivation: EdgeDerivationRegistry = PrivateAttr( + default_factory=EdgeDerivationRegistry + ) @model_validator(mode="after") def _init_model(self) -> IngestionModel: @@ -74,6 +89,7 @@ def finish_init( strict_references: bool = False, dynamic_edge_feedback: bool = False, allowed_vertex_names: set[str] | None = None, + target_db_flavor: DBType | None = None, ) -> None: """Initialize resources against graph model and transform library.""" self._rebuild_runtime_state() @@ -85,6 +101,7 @@ def finish_init( strict_references=strict_references, dynamic_edge_feedback=dynamic_edge_feedback, allowed_vertex_names=allowed_vertex_names, + target_db_flavor=target_db_flavor, ) def _rebuild_runtime_state(self) -> None: diff --git a/graflo/architecture/contract/declarations/resource.py b/graflo/architecture/contract/declarations/resource.py index d1310743..1151a92a 100644 --- a/graflo/architecture/contract/declarations/resource.py +++ b/graflo/architecture/contract/declarations/resource.py @@ -40,10 +40,13 @@ EdgeId, EncodingType, GraphEntity, + Weight, ) from graflo.architecture.schema.edge import Edge, EdgeConfig from graflo.architecture.schema.vertex import VertexConfig +from graflo.onto import DBType +from .edge_derivation_registry import EdgeDerivationRegistry from .transform import ProtoTransform if TYPE_CHECKING: @@ -143,6 +146,39 @@ def matches(self, edge_id: EdgeId) -> bool: ) +class ResourceExtraWeightEntry(ConfigBaseModel): + """Schema edge plus optional vertex-derived weight rules for DB enrichment.""" + + edge: Edge + vertex_weights: list[Weight] = PydanticField(default_factory=list) + + @model_validator(mode="before") + @classmethod + def _from_yaml(cls, data: Any) -> Any: + if data is None: + return data + if isinstance(data, Edge): + return {"edge": data, "vertex_weights": []} + if not isinstance(data, dict): + raise TypeError( + f"extra_weights item must be dict or Edge, got {type(data)}" + ) + d = dict(data) + vw_raw = d.pop("vertex_weights", None) or [] + if not isinstance(vw_raw, list): + vw_raw = [vw_raw] + v_w = [Weight.model_validate(x) for x in vw_raw] + if "edge" in d and isinstance(d["edge"], dict): + edge = Edge.model_validate(dict(d.pop("edge"))) + if d: + raise ValueError( + f"extra_weights entry has unexpected keys with 'edge': {sorted(d)}" + ) + return {"edge": edge, "vertex_weights": v_w} + edge = Edge.model_validate(d) + return {"edge": edge, "vertex_weights": v_w} + + class Resource(ConfigBaseModel): """Resource configuration and processing. @@ -153,6 +189,9 @@ class Resource(ConfigBaseModel): Dynamic vertex-type routing is handled by ``vertex_router`` steps in the pipeline (see :class:`~graflo.architecture.pipeline.runtime.actor.VertexRouterActor`). + Per-row relationship labels and location matching for edges belong on + ``edge`` pipeline steps (:class:`~graflo.architecture.edge_derivation.EdgeDerivation`), + not on ``Resource``. """ model_config = {"extra": "forbid"} @@ -175,9 +214,9 @@ class Resource(ConfigBaseModel): default_factory=list, description="List of collection names to merge when writing to the graph.", ) - extra_weights: list[Edge] = PydanticField( + extra_weights: list[ResourceExtraWeightEntry] = PydanticField( default_factory=list, - description="Additional edge weight configurations for this resource.", + description="Additional edge attribute / vertex-weight enrichment for this resource.", ) types: dict[str, str] = PydanticField( default_factory=dict, @@ -218,6 +257,7 @@ class Resource(ConfigBaseModel): _edge_config: EdgeConfig = PrivateAttr() _executor: ActorExecutor = PrivateAttr() _initialized: bool = PrivateAttr(default=False) + _edge_derivation_registry: EdgeDerivationRegistry | None = PrivateAttr(default=None) @model_validator(mode="after") def _build_root_and_types(self) -> Resource: @@ -292,6 +332,7 @@ def finish_init( strict_references: bool = False, dynamic_edge_feedback: bool = False, allowed_vertex_names: set[str] | None = None, + target_db_flavor: DBType | None = None, ) -> None: """Complete resource initialization. @@ -302,6 +343,7 @@ def finish_init( vertex_config: Configuration for vertices edge_config: Configuration for edges transforms: Dictionary of available transforms + target_db_flavor: Target graph DB flavor (for ingestion-time defaults, e.g. TigerGraph). """ self._rebuild_runtime( vertex_config=vertex_config, @@ -310,6 +352,7 @@ def finish_init( strict_references=strict_references, dynamic_edge_feedback=dynamic_edge_feedback, allowed_vertex_names=allowed_vertex_names, + target_db_flavor=target_db_flavor, ) def _edge_ids_from_edge_actors(self) -> set[EdgeId]: @@ -361,6 +404,7 @@ def _rebuild_runtime( strict_references: bool = False, dynamic_edge_feedback: bool = False, allowed_vertex_names: set[str] | None = None, + target_db_flavor: DBType | None = None, ) -> None: """Rebuild runtime actor initialization state from typed context.""" # Keep the full schema vertex_config for correctness validations, but @@ -386,16 +430,21 @@ def _rebuild_runtime( from graflo.architecture.pipeline.runtime.actor import ActorInitContext + edge_derivation_registry = EdgeDerivationRegistry() + object.__setattr__(self, "_edge_derivation_registry", edge_derivation_registry) + logger.debug("total resource actor count : %s", self.root.count()) init_ctx = ActorInitContext( vertex_config=runtime_vertex_config, edge_config=self._edge_config, transforms=transforms, + edge_derivation=edge_derivation_registry, allowed_vertex_names=allowed_vertex_names, infer_edges=self.infer_edges, infer_edge_only={spec.edge_id for spec in self.infer_edge_only}, infer_edge_except=infer_edge_except, strict_references=strict_references, + target_db_flavor=target_db_flavor, ) self.root.finish_init(init_ctx=init_ctx) object.__setattr__(self, "_initialized", True) @@ -414,8 +463,11 @@ def _rebuild_runtime( logger.debug("total resource actor count (after finit): %s", self.root.count()) - for e in self.extra_weights: - e.finish_init(vertex_config) + reg = self._edge_derivation_registry + for entry in self.extra_weights: + entry.edge.finish_init(vertex_config) + if reg is not None and entry.vertex_weights: + reg.merge_vertex_weights(entry.edge.edge_id, entry.vertex_weights) def __call__(self, doc: dict) -> defaultdict[GraphEntity, list]: """Process a document through the resource pipeline. diff --git a/graflo/architecture/contract/manifest.py b/graflo/architecture/contract/manifest.py index 9843fc0f..93d61cbc 100644 --- a/graflo/architecture/contract/manifest.py +++ b/graflo/architecture/contract/manifest.py @@ -65,6 +65,7 @@ def finish_init( self.graph_schema.core_schema, strict_references=strict_references, dynamic_edge_feedback=dynamic_edge_feedback, + target_db_flavor=self.graph_schema.db_profile.db_flavor, ) def require_schema(self) -> Schema: diff --git a/graflo/architecture/database_features.py b/graflo/architecture/database_features.py index 65d0f7ef..eea11f60 100644 --- a/graflo/architecture/database_features.py +++ b/graflo/architecture/database_features.py @@ -56,6 +56,14 @@ class DatabaseProfile(ConfigBaseModel): default=DBType.ARANGO, description="Target DB flavor used for physical naming and defaults.", ) + target_namespace: str | None = PydanticField( + default=None, + description=( + "Runtime target LPG namespace when the connection config leaves it unset: " + "Arango/Neo4j/FalkorDB/Memgraph database, TigerGraph graph name, Nebula space. " + "GraphEngine uses this before falling back to schema.metadata.name." + ), + ) vertex_storage_names: dict[str, str] = PydanticField( default_factory=dict, description="Physical vertex collection/label names keyed by logical vertex name.", diff --git a/graflo/architecture/edge_derivation.py b/graflo/architecture/edge_derivation.py new file mode 100644 index 00000000..a873a3cf --- /dev/null +++ b/graflo/architecture/edge_derivation.py @@ -0,0 +1,66 @@ +"""Ingestion-time wiring: how an edge step binds to extracted locations and documents. + +These fields do **not** belong in schema ``edge_config`` / :class:`~graflo.architecture.schema.edge.Edge`. +They are set on edge pipeline steps (:class:`~graflo.architecture.pipeline.runtime.actor.config.models.EdgeActorConfig`), +threaded through :class:`~graflo.architecture.graph_types.EdgeIntent`, and used in +:func:`~graflo.architecture.pipeline.runtime.actor.edge_render.render_edge`. + +When :attr:`EdgeDerivation.relation_from_key` is true, the ingestion +:class:`~graflo.architecture.contract.declarations.edge_derivation_registry.EdgeDerivationRegistry` +records the edge id so :class:`~graflo.architecture.schema.db_aware.EdgeConfigDBAware` (with overlay) +can align TigerGraph DDL with runtime. +""" + +from __future__ import annotations + +from pydantic import Field + +from graflo.architecture.base import ConfigBaseModel + + +class EdgeDerivation(ConfigBaseModel): + """How this edge step selects vertex locations and reads per-row relation from data.""" + + match_source: str | None = Field( + default=None, + description="Require this path segment in source vertex locations.", + ) + match_target: str | None = Field( + default=None, + description="Require this path segment in target vertex locations.", + ) + exclude_source: str | None = Field( + default=None, + description="Exclude source locations containing this path segment.", + ) + exclude_target: str | None = Field( + default=None, + description="Exclude target locations containing this path segment.", + ) + match: str | None = Field( + default=None, + description="Require this segment in both source and target locations.", + ) + relation_field: str | None = Field( + default=None, + description="Document/ctx field name for per-row relationship label when schema relation is unset.", + ) + relation_from_key: bool = Field( + default=False, + description="If True, derive the per-row relation label from the location key during assembly.", + ) + + def is_empty(self) -> bool: + if self.relation_from_key: + return False + return all( + getattr(self, name) is None + for name in ( + "match_source", + "match_target", + "exclude_source", + "exclude_target", + "match", + "relation_field", + ) + ) diff --git a/graflo/architecture/graph_types.py b/graflo/architecture/graph_types.py index 29d57e33..1ff8e176 100644 --- a/graflo/architecture/graph_types.py +++ b/graflo/architecture/graph_types.py @@ -37,6 +37,7 @@ from pydantic import ConfigDict, Field, model_validator from graflo.architecture.base import ConfigBaseModel +from graflo.architecture.edge_derivation import EdgeDerivation from graflo.onto import BaseEnum from graflo.onto import DBType @@ -418,6 +419,7 @@ class EdgeIntent(ConfigBaseModel): edge: Any location: LocationIndex | None = None provenance: ProvenancePath | None = None + derivation: EdgeDerivation | None = None class LocationIndex(ConfigBaseModel): @@ -549,12 +551,19 @@ def record_transform_observation( ) ) - def record_edge_intent(self, *, edge: Any, location: LocationIndex) -> None: + def record_edge_intent( + self, + *, + edge: Any, + location: LocationIndex, + derivation: EdgeDerivation | None = None, + ) -> None: self.edge_intents.append( EdgeIntent( edge=edge, location=location, provenance=ProvenancePath.from_lindex(location), + derivation=derivation, ) ) diff --git a/graflo/architecture/pipeline/runtime/actor/base.py b/graflo/architecture/pipeline/runtime/actor/base.py index 5de03d09..53ed3ee4 100644 --- a/graflo/architecture/pipeline/runtime/actor/base.py +++ b/graflo/architecture/pipeline/runtime/actor/base.py @@ -5,10 +5,14 @@ from abc import ABC, abstractmethod from dataclasses import dataclass, field +from graflo.architecture.contract.declarations.edge_derivation_registry import ( + EdgeDerivationRegistry, +) from graflo.architecture.schema.edge import EdgeConfig from graflo.architecture.graph_types import EdgeId, ExtractionContext, LocationIndex from graflo.architecture.contract.declarations.transform import ProtoTransform from graflo.architecture.schema.vertex import VertexConfig +from graflo.onto import DBType class ActorConstants: @@ -25,11 +29,15 @@ class ActorInitContext: vertex_config: VertexConfig edge_config: EdgeConfig transforms: dict[str, ProtoTransform] + edge_derivation: EdgeDerivationRegistry = field( + default_factory=EdgeDerivationRegistry + ) allowed_vertex_names: set[str] | None = None infer_edges: bool = True infer_edge_only: set[EdgeId] = field(default_factory=set) infer_edge_except: set[EdgeId] = field(default_factory=set) strict_references: bool = False + target_db_flavor: DBType | None = None class Actor(ABC): diff --git a/graflo/architecture/pipeline/runtime/actor/config/models.py b/graflo/architecture/pipeline/runtime/actor/config/models.py index c4559aa5..11cc9d78 100644 --- a/graflo/architecture/pipeline/runtime/actor/config/models.py +++ b/graflo/architecture/pipeline/runtime/actor/config/models.py @@ -7,8 +7,8 @@ from pydantic import Field as PydanticField, TypeAdapter, model_validator from graflo.architecture.base import ConfigBaseModel -from graflo.architecture.schema.edge import EdgeBase from graflo.architecture.contract.declarations.transform import DressConfig +from graflo.architecture.edge_derivation import EdgeDerivation from .normalize import normalize_actor_step @@ -299,8 +299,8 @@ def validate_target(self) -> "TransformCallConfig": return self -class EdgeActorConfig(EdgeBase): - """Configuration for an EdgeActor.""" +class EdgeActorConfig(ConfigBaseModel): + """Configuration for an EdgeActor (logical edge + ingestion derivation; flat YAML).""" type: Literal["edge"] = PydanticField( default="edge", description="Actor type discriminator" @@ -309,9 +309,63 @@ class EdgeActorConfig(EdgeBase): ..., alias="from", description="Source vertex type name" ) target: str = PydanticField(..., alias="to", description="Target vertex type name") - weights: dict[str, list[str]] | None = PydanticField( - default=None, description="Weight configuration" + relation: str | None = PydanticField( + default=None, + description="Optional fixed logical relation / edge type name.", + ) + relation_from_key: bool = PydanticField( + default=False, + description="Ingestion: derive per-row relation label from the location key during assembly.", + ) + description: str | None = PydanticField( + default=None, + description="Optional semantic description (merged into schema Edge).", + ) + relation_field: str | None = PydanticField( + default=None, + description="Ingestion: document field name for per-row relationship type.", + ) + match_source: str | None = PydanticField( + default=None, + description="Ingestion: require this path segment in source locations.", + ) + match_target: str | None = PydanticField( + default=None, + description="Ingestion: require this path segment in target locations.", + ) + exclude_source: str | None = PydanticField( + default=None, + description="Ingestion: exclude source locations containing this segment.", ) + exclude_target: str | None = PydanticField( + default=None, + description="Ingestion: exclude target locations containing this segment.", + ) + match: str | None = PydanticField( + default=None, + description="Ingestion: require this segment on both source and target locations.", + ) + properties: list[Any] = PydanticField( + default_factory=list, + description="Edge properties merged into schema Edge (same forms as Edge.properties).", + ) + vertex_weights: list[Any] = PydanticField( + default_factory=list, + description="Vertex-derived weight rules registered in EdgeDerivationRegistry.", + ) + + @property + def derivation(self) -> EdgeDerivation: + """Normalized ingestion-only fields for assembly/render.""" + return EdgeDerivation( + match_source=self.match_source, + match_target=self.match_target, + exclude_source=self.exclude_source, + exclude_target=self.exclude_target, + match=self.match, + relation_field=self.relation_field, + relation_from_key=self.relation_from_key, + ) @model_validator(mode="before") @classmethod diff --git a/graflo/architecture/pipeline/runtime/actor/edge.py b/graflo/architecture/pipeline/runtime/actor/edge.py index 61500c22..8cd04f4f 100644 --- a/graflo/architecture/pipeline/runtime/actor/edge.py +++ b/graflo/architecture/pipeline/runtime/actor/edge.py @@ -6,38 +6,69 @@ from .base import Actor, ActorInitContext from .config import EdgeActorConfig +from graflo.architecture.edge_derivation import EdgeDerivation from graflo.architecture.schema.edge import Edge -from graflo.architecture.graph_types import ExtractionContext, LocationIndex +from graflo.architecture.graph_types import ExtractionContext, LocationIndex, Weight class EdgeActor(Actor): """Actor for processing edge data.""" def __init__(self, config: EdgeActorConfig): - kwargs = config.model_dump(by_alias=False, exclude_none=True) - kwargs.pop("type", None) - self.edge = Edge.from_dict(kwargs) + self.derivation: EdgeDerivation = config.derivation + self._pending_vertex_weights: list[Weight] = [] + payload: dict[str, Any] = { + "source": config.source, + "target": config.target, + } + if config.relation is not None: + payload["relation"] = config.relation + if config.description is not None: + payload["description"] = config.description + if config.properties: + payload["properties"] = config.properties + for item in config.vertex_weights: + self._pending_vertex_weights.append(Weight.model_validate(item)) + self.edge = Edge.from_dict(payload) self.vertex_config: Any = None self.allowed_vertex_names: set[str] | None = None + @property + def relation_field(self) -> str | None: + """Alias for tooling (e.g. plot labels).""" + return self.derivation.relation_field + @classmethod def from_config(cls, config: EdgeActorConfig) -> EdgeActor: return cls(config) def fetch_important_items(self) -> dict[str, Any]: return { - k: self.edge.__dict__[k] - for k in ["source", "target", "match_source", "match_target"] - if k in self.edge.__dict__ + k: v + for k, v in { + "source": self.edge.source, + "target": self.edge.target, + "match_source": self.derivation.match_source, + "match_target": self.derivation.match_target, + }.items() + if v is not None } def finish_init(self, init_ctx: ActorInitContext) -> None: self.vertex_config = init_ctx.vertex_config self.allowed_vertex_names = init_ctx.allowed_vertex_names if self.vertex_config is not None: + edge_id = self.edge.edge_id init_ctx.edge_config.update_edges( self.edge, vertex_config=self.vertex_config ) + if self.derivation.relation_from_key: + init_ctx.edge_derivation.mark_relation_from_key(edge_id) + if self._pending_vertex_weights: + init_ctx.edge_derivation.merge_vertex_weights( + edge_id, self._pending_vertex_weights + ) + self.edge = init_ctx.edge_config.edge_for(edge_id) def __call__( self, ctx: ExtractionContext, lindex: LocationIndex, *nargs: Any, **kwargs: Any @@ -59,7 +90,12 @@ def __call__( return ctx ctx.edge_requests.append((self.edge, lindex)) - ctx.record_edge_intent(edge=self.edge, location=lindex) + der = None if self.derivation.is_empty() else self.derivation + ctx.record_edge_intent( + edge=self.edge, + location=lindex, + derivation=der, + ) return ctx def references_vertices(self) -> set[str]: diff --git a/graflo/architecture/pipeline/runtime/actor/edge_render.py b/graflo/architecture/pipeline/runtime/actor/edge_render.py index 90097c31..67cbad3c 100644 --- a/graflo/architecture/pipeline/runtime/actor/edge_render.py +++ b/graflo/architecture/pipeline/runtime/actor/edge_render.py @@ -8,6 +8,7 @@ from itertools import combinations, product, zip_longest from typing import Any, Callable, Iterable, Iterator +from graflo.architecture.edge_derivation import EdgeDerivation from graflo.architecture.schema.edge import Edge from graflo.architecture.graph_types import ( ActionContext, @@ -16,6 +17,7 @@ LocationIndex, TransformPayload, VertexRep, + Weight, ) from graflo.architecture.util import project_dict from graflo.architecture.schema.vertex import VertexConfig @@ -33,7 +35,7 @@ def add_blank_collections( for vname in vertex_conf.blank_vertices: v = vertex_conf[vname] for item in buffer_transforms: - prep_doc = {f: item[f] for f in v.field_names if f in item} + prep_doc = {f: item[f] for f in v.property_names if f in item} if vname not in ctx.acc_global: ctx.acc_global[vname] = [prep_doc] return ctx @@ -99,22 +101,23 @@ def count_unique_by_position_variable(tuples_list: list, fillvalue: Any = None) def _filter_source_target_lindexes( - edge: Edge, + derivation: EdgeDerivation | None, source_locs: list[LocationIndex], target_locs: list[LocationIndex], ) -> tuple[list[LocationIndex], list[LocationIndex]]: - """Apply match/exclude filters from edge config to source and target locations.""" - if edge.match_source is not None: - source_locs = [loc for loc in source_locs if edge.match_source in loc] - if edge.exclude_source is not None: - source_locs = [loc for loc in source_locs if edge.exclude_source not in loc] - if edge.match_target is not None: - target_locs = [loc for loc in target_locs if edge.match_target in loc] - if edge.exclude_target is not None: - target_locs = [loc for loc in target_locs if edge.exclude_target not in loc] - if edge.match is not None: - source_locs = [loc for loc in source_locs if edge.match in loc] - target_locs = [loc for loc in target_locs if edge.match in loc] + """Apply match/exclude filters from ingestion derivation to source/target locations.""" + d = derivation or EdgeDerivation() + if d.match_source is not None: + source_locs = [loc for loc in source_locs if d.match_source in loc] + if d.exclude_source is not None: + source_locs = [loc for loc in source_locs if d.exclude_source not in loc] + if d.match_target is not None: + target_locs = [loc for loc in target_locs if d.match_target in loc] + if d.exclude_target is not None: + target_locs = [loc for loc in target_locs if d.exclude_target not in loc] + if d.match is not None: + source_locs = [loc for loc in source_locs if d.match in loc] + target_locs = [loc for loc in target_locs if d.match in loc] return source_locs, target_locs @@ -162,7 +165,7 @@ def _compute_location_groups( def _iter_emitter_receiver_group_pairs( source_groups: list[list[LocationIndex]], target_groups: list[list[LocationIndex]], - edge: Edge, + derivation: EdgeDerivation | None, source_name: str, target_name: str, ) -> Iterator[tuple[list[LocationIndex], list[LocationIndex]]]: @@ -171,7 +174,8 @@ def _iter_emitter_receiver_group_pairs( yield from zip(source_groups, target_groups) return - if edge.match_source is not None and edge.match_target is not None: + d = derivation or EdgeDerivation() + if d.match_source is not None and d.match_target is not None: yield from zip(source_groups, target_groups) return @@ -225,8 +229,17 @@ def render_edge( vertex_config: VertexConfig, ctx: AssemblyContext | ActionContext, lindex: LocationIndex | None = None, + *, + relation_input_field: str | None = None, + derivation: EdgeDerivation | None = None, ) -> defaultdict[str | None, list]: - """Create edges between source and target vertices.""" + """Create edges between source and target vertices. + + Args: + relation_input_field: Document/ctx field for per-row relationship labels when + ``edge.relation`` is unset (e.g. TigerGraph default column). + derivation: Ingestion-only location / field wiring (edge pipeline step). + """ acc_vertex = ctx.acc_vertex buffer_transforms = ctx.buffer_transforms source = edge.source @@ -248,7 +261,7 @@ def render_edge( target_locs = sorted(lindex.filter(target_locs)) source_locs, target_locs = _filter_source_target_lindexes( - edge, source_locs, target_locs + derivation, source_locs, target_locs ) if not (source_locs and target_locs): @@ -283,7 +296,7 @@ def render_edge( ) for source_group, target_group in _iter_emitter_receiver_group_pairs( - source_groups, target_groups, edge, source, target + source_groups, target_groups, derivation, source, target ): for source_loc in source_group: source_items = source_dressed[source_loc] @@ -305,43 +318,50 @@ def render_edge( v_doc = v_rep.vertex weight: dict[str, Any] = {} - if edge.weights is not None: - for field in edge.weights.direct: + if edge.properties: + for field in edge.properties: field_name = field.name - if field in u_rep.ctx: - weight[field_name] = u_rep.ctx[field] - if field in v_rep.ctx: - weight[field_name] = v_rep.ctx[field] - if field in u_tr: - weight[field_name] = u_tr[field] - if field in v_tr: - weight[field_name] = v_tr[field] + # Direct weights may live on observation ctx (row leftovers) or on + # merged vertex docs (passthrough fields merged in VertexActor). + if field_name in u_rep.ctx: + weight[field_name] = u_rep.ctx[field_name] + if field_name in v_rep.ctx: + weight[field_name] = v_rep.ctx[field_name] + if field_name in u_doc: + weight[field_name] = u_doc[field_name] + if field_name in v_doc: + weight[field_name] = v_doc[field_name] + if field_name in u_tr: + weight[field_name] = u_tr[field_name] + if field_name in v_tr: + weight[field_name] = v_tr[field_name] source_proj = project_dict(u_doc, source_identity) target_proj = project_dict(v_doc, target_identity) extracted_relation = None - if edge.relation_field is not None: - u_relation = u_rep.ctx.pop(edge.relation_field, None) + if relation_input_field is not None: + u_relation = u_rep.ctx.pop(relation_input_field, None) if u_relation is None: - v_relation = v_rep.ctx.pop(edge.relation_field, None) + v_relation = v_rep.ctx.pop(relation_input_field, None) if v_relation is not None: source_proj, target_proj = target_proj, source_proj extracted_relation = v_relation else: extracted_relation = u_relation - if ( - extracted_relation is None - and edge.relation_from_key - and len(target_loc) > 1 - ): + use_key = ( + derivation.relation_from_key + if derivation is not None + else False + ) + if extracted_relation is None and use_key and len(target_loc) > 1: extracted_relation = _extract_relation_from_key( source_loc, target_loc, source_min_depth, target_min_depth ) - if edge.relation_from_key and extracted_relation is None: + if use_key and extracted_relation is None: continue relation = ( @@ -358,9 +378,11 @@ def render_weights( vertex_config: VertexConfig, acc_vertex: defaultdict[str, defaultdict[LocationIndex, list]], edges: defaultdict[str | None, list], + *, + vertex_weights: list[Weight] | None = None, ) -> defaultdict[str | None, list]: """Process and apply weights to edge documents.""" - vertex_weights = [] if edge.weights is None else edge.weights.vertices + vertex_weights = vertex_weights or [] weights: list = [] for w in vertex_weights: diff --git a/graflo/architecture/pipeline/runtime/actor/edge_router.py b/graflo/architecture/pipeline/runtime/actor/edge_router.py index 5739b000..8492e631 100644 --- a/graflo/architecture/pipeline/runtime/actor/edge_router.py +++ b/graflo/architecture/pipeline/runtime/actor/edge_router.py @@ -7,6 +7,7 @@ from .base import Actor, ActorInitContext from .config import EdgeRouterActorConfig +from graflo.architecture.edge_derivation import EdgeDerivation from graflo.architecture.schema.edge import Edge, EdgeConfig from graflo.architecture.graph_types import ExtractionContext, LocationIndex, VertexRep from graflo.architecture.schema.vertex import VertexConfig @@ -181,7 +182,12 @@ def __call__( edge = self._get_or_create_edge(source_name, target_name, relation) ctx.edge_requests.append((edge, lindex)) - ctx.record_edge_intent(edge=edge, location=lindex) + router_derivation = EdgeDerivation(relation_field=self.relation_field) + ctx.record_edge_intent( + edge=edge, + location=lindex, + derivation=None if router_derivation.is_empty() else router_derivation, + ) return ctx def references_vertices(self) -> set[str]: diff --git a/graflo/architecture/pipeline/runtime/actor/vertex.py b/graflo/architecture/pipeline/runtime/actor/vertex.py index 017d306e..4c395fce 100644 --- a/graflo/architecture/pipeline/runtime/actor/vertex.py +++ b/graflo/architecture/pipeline/runtime/actor/vertex.py @@ -139,7 +139,7 @@ def __call__( ): return ctx - vertex_keys_list = self.vertex_config.fields_names(self.name) + vertex_keys_list = self.vertex_config.property_names(self.name) vertex_keys: tuple[str, ...] = tuple(vertex_keys_list) agg = [] diff --git a/graflo/architecture/pipeline/runtime/actor/wrapper.py b/graflo/architecture/pipeline/runtime/actor/wrapper.py index 3f52fb8a..d269c017 100644 --- a/graflo/architecture/pipeline/runtime/actor/wrapper.py +++ b/graflo/architecture/pipeline/runtime/actor/wrapper.py @@ -31,6 +31,7 @@ LocationIndex, ) from graflo.architecture.schema.vertex import VertexConfig +from graflo.onto import DBType from graflo.util.merge import merge_doc_basis from graflo.util.transform import pick_unique_dict @@ -71,6 +72,10 @@ def infer_edge_only(self) -> set[EdgeId]: def infer_edge_except(self) -> set[EdgeId]: return self.init_ctx.infer_edge_except + @property + def target_db_flavor(self) -> DBType | None: + return self.init_ctx.target_db_flavor + def init_transforms(self, init_ctx: ActorInitContext) -> None: self.init_ctx = init_ctx self.actor.init_transforms(init_ctx) @@ -145,6 +150,8 @@ def assemble( infer_edges=self.infer_edges, infer_edge_only=self.infer_edge_only, infer_edge_except=self.infer_edge_except, + target_db_flavor=self.target_db_flavor, + edge_derivation=self.init_ctx.edge_derivation, ) for vertex_name, dd in assembly_ctx.acc_vertex.items(): diff --git a/graflo/architecture/pipeline/runtime/assemble.py b/graflo/architecture/pipeline/runtime/assemble.py index b0d8e20a..b0b83f4a 100644 --- a/graflo/architecture/pipeline/runtime/assemble.py +++ b/graflo/architecture/pipeline/runtime/assemble.py @@ -5,12 +5,37 @@ from typing import Any from .actor.edge_render import render_edge, render_weights -from graflo.architecture.schema.edge import EdgeConfig +from graflo.architecture.contract.declarations.edge_derivation_registry import ( + EdgeDerivationRegistry, +) +from graflo.architecture.schema.edge import ( + DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME, + Edge, + EdgeConfig, +) +from graflo.architecture.edge_derivation import EdgeDerivation from graflo.architecture.graph_types import AssemblyContext, EdgeId, LocationIndex from graflo.architecture.schema.vertex import VertexConfig +from graflo.onto import DBType from graflo.util.merge import merge_doc_basis +def _resolved_relation_input_field( + edge: Edge, + *, + derivation: EdgeDerivation | None, + target_db_flavor: DBType | None, +) -> str | None: + """Document/ctx field used to read per-row relation when schema relation is unset.""" + if edge.relation is not None: + return None + if derivation is not None and derivation.relation_field is not None: + return derivation.relation_field + if target_db_flavor == DBType.TIGERGRAPH: + return DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME + return None + + def _merge_vertices_for_edge( ctx: AssemblyContext, vertex_config: VertexConfig, source: str, target: str ) -> None: @@ -27,10 +52,25 @@ def _emit_edge_documents( vertex_config: VertexConfig, edge: Any, lindex: LocationIndex | None, + relation_input_field: str | None = None, + derivation: EdgeDerivation | None = None, + edge_derivation: EdgeDerivationRegistry | None = None, ) -> bool: _merge_vertices_for_edge(ctx, vertex_config, edge.source, edge.target) - edges = render_edge(edge=edge, vertex_config=vertex_config, ctx=ctx, lindex=lindex) - edges = render_weights(edge, vertex_config, ctx.acc_vertex, edges) + edges = render_edge( + edge=edge, + vertex_config=vertex_config, + ctx=ctx, + lindex=lindex, + relation_input_field=relation_input_field, + derivation=derivation, + ) + vertex_rules: list = [] + if edge_derivation is not None: + vertex_rules = edge_derivation.vertex_weights_for(edge.edge_id) + edges = render_weights( + edge, vertex_config, ctx.acc_vertex, edges, vertex_weights=vertex_rules + ) emitted = False for relation, edocs in edges.items(): if edocs: @@ -70,6 +110,8 @@ def assemble_edges( infer_edges: bool, infer_edge_only: set[EdgeId] | None = None, infer_edge_except: set[EdgeId] | None = None, + target_db_flavor: DBType | None = None, + edge_derivation: EdgeDerivationRegistry | None = None, ) -> None: """Assemble all edge documents after extraction finishes.""" if infer_edge_only is None: @@ -79,20 +121,43 @@ def assemble_edges( emitted_pairs: set[tuple[str, str]] = set() - explicit_requests: list[tuple[Any, LocationIndex | None]] = [ - (intent.edge, intent.location) for intent in ctx.edge_intents - ] - if not explicit_requests: - explicit_requests = list(ctx.edge_requests) - - for edge, lindex in explicit_requests: - if _emit_edge_documents( - ctx=ctx, - vertex_config=vertex_config, - edge=edge, - lindex=lindex, - ): - emitted_pairs.add((edge.source, edge.target)) + if ctx.edge_intents: + for intent in ctx.edge_intents: + edge = intent.edge + relation_input = _resolved_relation_input_field( + edge, + derivation=intent.derivation, + target_db_flavor=target_db_flavor, + ) + if _emit_edge_documents( + ctx=ctx, + vertex_config=vertex_config, + edge=edge, + lindex=intent.location, + relation_input_field=relation_input, + derivation=intent.derivation, + edge_derivation=edge_derivation, + ): + emitted_pairs.add((edge.source, edge.target)) + else: + for item in ctx.edge_requests: + edge = item[0] + lindex = item[1] + relation_input = _resolved_relation_input_field( + edge, + derivation=None, + target_db_flavor=target_db_flavor, + ) + if _emit_edge_documents( + ctx=ctx, + vertex_config=vertex_config, + edge=edge, + lindex=lindex, + relation_input_field=relation_input, + derivation=None, + edge_derivation=edge_derivation, + ): + emitted_pairs.add((edge.source, edge.target)) ctx.edge_requests = [] ctx.extraction.edge_intents = [] @@ -110,10 +175,18 @@ def assemble_edges( infer_edge_except=infer_edge_except, ): continue + relation_input = _resolved_relation_input_field( + edge, + derivation=None, + target_db_flavor=target_db_flavor, + ) if _emit_edge_documents( ctx=ctx, vertex_config=vertex_config, edge=edge, lindex=None, + relation_input_field=relation_input, + derivation=None, + edge_derivation=edge_derivation, ): emitted_pairs.add((s, t)) diff --git a/graflo/architecture/schema/core.py b/graflo/architecture/schema/core.py index 3c469385..008ed1ad 100644 --- a/graflo/architecture/schema/core.py +++ b/graflo/architecture/schema/core.py @@ -14,7 +14,7 @@ class CoreSchema(ConfigBaseModel): vertex_config: VertexConfig = PydanticField( ..., - description="Configuration for vertex collections (vertices, identities, fields).", + description="Configuration for vertex collections (vertices, identities, properties).", ) edge_config: EdgeConfig = PydanticField( ..., diff --git a/graflo/architecture/schema/db_aware.py b/graflo/architecture/schema/db_aware.py index 30198f78..4503abe4 100644 --- a/graflo/architecture/schema/db_aware.py +++ b/graflo/architecture/schema/db_aware.py @@ -7,10 +7,12 @@ from __future__ import annotations from dataclasses import dataclass -from typing import Iterator +from typing import Iterator, Protocol, runtime_checkable, Any + +from pydantic import Field as PydanticField, field_validator from graflo.architecture.database_features import DatabaseProfile -from graflo.architecture.graph_types import EdgeId, Index +from graflo.architecture.graph_types import EdgeId, Index, Weight from graflo.onto import DBType from .edge import ( @@ -18,9 +20,17 @@ DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME, Edge, EdgeConfig, - WeightConfig, + _normalize_direct_item, ) from .vertex import Field, FieldType, VertexConfig +from ..base import ConfigBaseModel + + +@runtime_checkable +class EdgeIngestionOverlay(Protocol): + """Ingestion-only signals that affect DB projection (e.g. TigerGraph DDL).""" + + def uses_relation_from_key(self, edge_id: EdgeId) -> bool: ... @dataclass(frozen=True) @@ -97,18 +107,73 @@ def identity_fields(self, vertex_name: str) -> list[str]: return ["_key"] if self.db_profile.db_flavor == DBType.ARANGO else ["id"] return identity - def fields(self, vertex_name: str) -> list[Field]: - fields = self.logical.fields(vertex_name) + def properties(self, vertex_name: str) -> list[Field]: + props = self.logical.properties(vertex_name) if self.db_profile.db_flavor != DBType.TIGERGRAPH: - return fields + return props # TigerGraph needs explicit scalar defaults for schema definition. return [ Field(name=f.name, type=FieldType.STRING if f.type is None else f.type) - for f in fields + for f in props ] - def fields_names(self, vertex_name: str) -> list[str]: - return [f.name for f in self.fields(vertex_name)] + def property_names(self, vertex_name: str) -> list[str]: + return [f.name for f in self.properties(vertex_name)] + + +class WeightConfig(ConfigBaseModel): + """Configuration for edge weights and relationships. + + This class manages the configuration of weights and relationships for edges, + including source and target field mappings. + + Attributes: + vertices: List of weight configurations + direct: List of direct field mappings. Can be specified as strings, Field objects, or dicts. + Will be normalized to Field objects by the validator. + After initialization, this is always list[Field] (type checker sees this). + + Examples: + >>> # List of strings + >>> wc1 = WeightConfig(direct=["date", "weight"]) + + >>> # Typed fields: list of Field objects + >>> wc2 = WeightConfig(direct=[ + ... Field(name="date", type="DATETIME"), + ... Field(name="weight", type="FLOAT") + ... ]) + + >>> # From dicts (e.g., from YAML/JSON) + >>> wc3 = WeightConfig(direct=[ + ... {"name": "date", "type": "DATETIME"}, + ... {"name": "weight"} # defaults to None type + ... ]) + """ + + vertices: list[Weight] = PydanticField( + default_factory=list, + description="List of weight definitions for vertex-based edge attributes.", + ) + direct: list[Field] = PydanticField( + default_factory=list, + description="Direct edge attributes (field names, Field objects, or dicts). Normalized to Field objects.", + ) + + @field_validator("direct", mode="before") + @classmethod + def normalize_direct(cls, v: Any) -> Any: + if not isinstance(v, list): + return v + return [_normalize_direct_item(item) for item in v] + + @property + def direct_names(self) -> list[str]: + """Get list of direct field names (as strings). + + Returns: + list[str]: List of field names + """ + return [field.name for field in self.direct] class EdgeConfigDBAware: @@ -119,10 +184,17 @@ def __init__( logical: EdgeConfig, vertex_config: VertexConfigDBAware, database_features: DatabaseProfile, + ingestion_overlay: EdgeIngestionOverlay | None = None, ): self.logical = logical self.vertex_config = vertex_config self.db_profile = database_features + self.ingestion_overlay = ingestion_overlay + + def _uses_relation_from_key(self, edge_id: EdgeId) -> bool: + if self.ingestion_overlay is not None: + return self.ingestion_overlay.uses_relation_from_key(edge_id) + return False @property def edges(self) -> list[Edge]: @@ -151,54 +223,61 @@ def relation_dbname(self, edge: Edge) -> str | None: ) def effective_weights(self, edge: Edge) -> WeightConfig | None: - if self.db_profile.db_flavor != DBType.TIGERGRAPH: - return edge.weights - - relation_field = edge.relation_field - if relation_field is None and edge.relation_from_key: - relation_field = DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME + def _as_weight_config() -> WeightConfig | None: + if not edge.properties: + return None + return WeightConfig( + direct=[f.model_copy(deep=True) for f in edge.properties], + ) - if relation_field is None: - return edge.weights + if self.db_profile.db_flavor != DBType.TIGERGRAPH: + return _as_weight_config() - base = ( - edge.weights.model_copy(deep=True) - if edge.weights is not None - else WeightConfig() + # Typed TigerGraph edge: per-row relation label stored under a stable attribute. + needs_relation_attr = edge.relation is None or self._uses_relation_from_key( + edge.edge_id ) - if relation_field not in base.direct_names: - base.direct.append(Field(name=relation_field, type=FieldType.STRING)) + if not needs_relation_attr: + return _as_weight_config() + + base = _as_weight_config() or WeightConfig() + if DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME not in base.direct_names: + base.direct.append( + Field( + name=DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME, type=FieldType.STRING + ) + ) return base def runtime(self, edge: Edge) -> EdgeRuntime: + needs_tg_relation_attr = self.db_profile.db_flavor == DBType.TIGERGRAPH and ( + edge.relation is None or self._uses_relation_from_key(edge.edge_id) + ) runtime = EdgeRuntime( edge=edge, source_storage=self.vertex_config.vertex_dbname(edge.source), target_storage=self.vertex_config.vertex_dbname(edge.target), relation_name=self.relation_dbname(edge), - store_extracted_relation_as_weight=( - self.db_profile.db_flavor == DBType.TIGERGRAPH - ), + store_extracted_relation_as_weight=needs_tg_relation_attr, effective_relation_field=( - edge.relation_field - if edge.relation_field is not None - else ( - DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME - if self.db_profile.db_flavor == DBType.TIGERGRAPH - and edge.relation_from_key - else None - ) + DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME + if needs_tg_relation_attr + else None ), db_profile=self.db_profile, ) return runtime def relationship_merge_property_names(self, edge: Edge) -> list[str]: - """Relationship properties that distinguish parallel edges (Cypher MERGE, etc.). + """Relationship properties used for edge upsert/MERGE keys (per backend). - Uses the first logical ``identities`` key when present (endpoints omitted — - they are already matched on nodes). If that key yields no relationship - fields, or ``identities`` is empty, falls back to all direct weight names. + Uniqueness is ``(source_id, *identity_fields, target_id)`` for the **first** + logical ``identities`` key (endpoints are matched separately on vertices). + Additional ``identities`` keys are compiled into separate unique indexes + via :meth:`compile_identity_indexes` but do not change the writer merge key. + + If that key yields no relationship fields, or ``identities`` is empty, + falls back to all declared edge attribute names. """ db_flavor = self.db_profile.db_flavor if edge.identities: @@ -207,8 +286,8 @@ def relationship_merge_property_names(self, edge: Edge) -> list[str]: ) if props: return props - if edge.weights is not None and edge.weights.direct_names: - return list(edge.weights.direct_names) + if edge.property_names: + return list(edge.property_names) return [] @staticmethod @@ -239,9 +318,14 @@ def compile_identity_indexes(self) -> None: ) if not identity_fields: continue + fields, unique = self._normalize_edge_identity_index( + identity_fields, db_flavor + ) + if not fields: + continue self.db_profile.add_edge_index( edge.edge_id, - Index(fields=identity_fields, unique=True), + Index(fields=fields, unique=unique), purpose=None, ) @@ -267,6 +351,37 @@ def _identity_key_index_fields( deduped.append(field) return deduped + @staticmethod + def _normalize_edge_identity_index( + fields: list[str], db_flavor: DBType + ) -> tuple[list[str], bool]: + """Map logical edge identity to physical index fields and DB uniqueness. + + Logical uniqueness is always ``(source, *relationship_fields, target)``. + + * **ArangoDB** — Edge documents carry ``_from`` / ``_to``. Unique persistent + indexes must include them before other fields, even when the YAML + ``identities`` entry lists only relationship tokens (e.g. ``_role``). + * **Neo4j, FalkorDB, Memgraph, Nebula** — Indexed columns are relationship / + edge-type properties only; they cannot express endpoint scope. We still + register the property fields for lookups but set ``unique=False`` so the + database is not asked to enforce a misleading global uniqueness on those + properties alone. (Application MERGE / ingest semantics remain authoritative.) + * **TigerGraph** — Edge secondary indexes are not applied by the driver today; + fields are kept for profiling; uniqueness is preserved for consistency. + """ + rest = [f for f in fields if f not in ("_from", "_to")] + if db_flavor == DBType.ARANGO: + return (["_from", "_to", *rest], True) + if db_flavor in ( + DBType.NEO4J, + DBType.FALKORDB, + DBType.MEMGRAPH, + DBType.NEBULA, + ): + return (fields, False) + return (fields, True) + @dataclass(frozen=True) class SchemaDBAware: diff --git a/graflo/architecture/schema/edge.py b/graflo/architecture/schema/edge.py index b6e895f4..e8e0dfae 100644 --- a/graflo/architecture/schema/edge.py +++ b/graflo/architecture/schema/edge.py @@ -5,10 +5,10 @@ The module supports both ArangoDB and Neo4j through the DBType enum. Key Components: - - EdgeBase: Shared base for edge-like configs (Edge and EdgeActorConfig) - - Edge: Represents an edge with its source, target, and configuration + - Edge: Abstract graph edge kind (schema / ``edge_config`` only) + - EdgeDerivation: Ingestion wiring (see ``graflo.architecture.edge_derivation``) - EdgeConfig: Manages collections of edges and their configurations - - WeightConfig: Configuration for edge weights and relationships + - WeightConfig: DTO for DB projection helpers (e.g. effective weights); schema uses ``properties`` Example: >>> edge = Edge(source="user", target="post") @@ -31,7 +31,6 @@ from graflo.architecture.graph_types import ( EdgeId, EdgeType, - Weight, ) from graflo.architecture.schema.vertex import Field, VertexConfig @@ -61,66 +60,12 @@ def _normalize_direct_item(item: str | Field | dict[str, Any]) -> Field: raise TypeError(f"Field must be str, Field, or dict, got {type(item)}") -class WeightConfig(ConfigBaseModel): - """Configuration for edge weights and relationships. +class Edge(ConfigBaseModel): + """Abstract graph edge kind (schema / ``edge_config`` only). - This class manages the configuration of weights and relationships for edges, - including source and target field mappings. - - Attributes: - vertices: List of weight configurations - direct: List of direct field mappings. Can be specified as strings, Field objects, or dicts. - Will be normalized to Field objects by the validator. - After initialization, this is always list[Field] (type checker sees this). - - Examples: - >>> # Backward compatible: list of strings - >>> wc1 = WeightConfig(direct=["date", "weight"]) - - >>> # Typed fields: list of Field objects - >>> wc2 = WeightConfig(direct=[ - ... Field(name="date", type="DATETIME"), - ... Field(name="weight", type="FLOAT") - ... ]) - - >>> # From dicts (e.g., from YAML/JSON) - >>> wc3 = WeightConfig(direct=[ - ... {"name": "date", "type": "DATETIME"}, - ... {"name": "weight"} # defaults to None type - ... ]) - """ - - vertices: list[Weight] = PydanticField( - default_factory=list, - description="List of weight definitions for vertex-based edge attributes.", - ) - direct: list[Field] = PydanticField( - default_factory=list, - description="Direct edge attributes (field names, Field objects, or dicts). Normalized to Field objects.", - ) - - @field_validator("direct", mode="before") - @classmethod - def normalize_direct(cls, v: Any) -> Any: - if not isinstance(v, list): - return v - return [_normalize_direct_item(item) for item in v] - - @property - def direct_names(self) -> list[str]: - """Get list of direct field names (as strings). - - Returns: - list[str]: List of field names - """ - return [field.name for field in self.direct] - - -class EdgeBase(ConfigBaseModel): - """Shared base for edge-like configs (Edge schema and EdgeActorConfig). - - Holds the common scalar fields so Edge and EdgeActorConfig stay in sync - without duplication. + Ingestion-only behavior (location filters, relation column, relation from + key, etc.) belongs on :class:`~graflo.architecture.edge_derivation.EdgeDerivation` + in pipeline edge steps, not on this model. """ source: str = PydanticField( @@ -131,72 +76,33 @@ class EdgeBase(ConfigBaseModel): ..., description="Target vertex type name (e.g. post, company).", ) - match_source: str | None = PydanticField( - default=None, - description="Field used to match source vertices when creating edges.", - ) - match_target: str | None = PydanticField( - default=None, - description="Field used to match target vertices when creating edges.", - ) relation: str | None = PydanticField( default=None, description="Relation/edge type name (e.g. Neo4j relationship type). For ArangoDB used as weight.", ) - relation_field: str | None = PydanticField( - default=None, - description="Field name to store or read relation type (e.g. for TigerGraph).", - ) - relation_from_key: bool = PydanticField( - default=False, - description="If True, derive relation value from the location key during ingestion.", - ) - exclude_source: str | None = PydanticField( - default=None, - description="Exclude source vertices matching this field from edge creation.", - ) - exclude_target: str | None = PydanticField( - default=None, - description="Exclude target vertices matching this field from edge creation.", - ) - match: str | None = PydanticField( - default=None, - description="Match discriminant for edge creation.", - ) description: str | None = PydanticField( default=None, description="Optional semantic description of edge intent, direction semantics, and business meaning.", ) - -class Edge(EdgeBase): - """Represents an edge in the graph database. - - An edge connects two vertices and can have various configurations for - identities, weights, and relationship types. - - Attributes: - source: Source vertex name - target: Target vertex name - identities: Logical candidate identity keys for the edge - weights: Optional weight configuration - relation: Optional relation name (for Neo4j) - match_source: Optional source discriminant field - match_target: Optional target discriminant field - type: Edge type (DIRECT or INDIRECT) - by: Optional vertex name for indirect edges - """ - identities: list[list[str]] = PydanticField( default_factory=list, description=( - "Logical candidate identity keys for this edge. " - "Each key is a list of identity tokens/fields." + "Logical uniqueness keys for this edge: each key names fields that, " + "together with the resolved source and target vertex ids, must be unique " + "(``source`` / ``target`` tokens stand for endpoints; other tokens are edge " + "attributes). Multiple keys define multiple uniqueness constraints. " + "Non-endpoint tokens are merged into ``properties`` during " + ":meth:`finish_init` if not already declared (same idea as vertex identity)." ), ) - weights: WeightConfig | None = PydanticField( - default=None, - description="Optional edge weight/attribute configuration (direct fields and vertex-based weights).", + properties: list[Field] = PydanticField( + default_factory=list, + description=( + "Edge property names/types (relationship properties). " + "Vertex-derived bindings belong in ingestion (:class:`~graflo.architecture.contract." + "declarations.edge_derivation_registry.EdgeDerivationRegistry`)." + ), ) type: EdgeType = PydanticField( @@ -209,6 +115,13 @@ class Edge(EdgeBase): description="For INDIRECT edges: vertex type name used to define the edge.", ) + @field_validator("properties", mode="before") + @classmethod + def normalize_properties(cls, v: Any) -> Any: + if not isinstance(v, list): + return v + return [_normalize_direct_item(item) for item in v] + @field_validator("identities", mode="before") @classmethod def normalize_identities(cls, v: Any) -> Any: @@ -249,18 +162,35 @@ def normalize_identity_keys(self) -> "Edge": def finish_init(self, vertex_config: VertexConfig): """Complete logical edge initialization with vertex configuration.""" _ = vertex_config + self._merge_identity_fields_into_properties() self._validate_identity_tokens() + def _merge_identity_fields_into_properties(self) -> None: + """Append :class:`Field` entries for identity tokens not already declared. + + Endpoint tokens ``source`` and ``target`` are not edge properties; every + other token (including ``relation``) is materialized like vertex identity. + """ + endpoint_tokens = frozenset({"source", "target"}) + seen_names = {f.name for f in self.properties} + augmented = list(self.properties) + for key in self.identities: + for token in key: + if token in endpoint_tokens: + continue + if token not in seen_names: + augmented.append(Field(name=token, type=None)) + seen_names.add(token) + object.__setattr__(self, "properties", augmented) + def _validate_identity_tokens(self) -> None: """Validate edge identity keys against reserved tokens and declared edge fields.""" reserved = {"source", "target", "relation"} - direct_weight_fields = set() - if self.weights is not None: - direct_weight_fields = set(self.weights.direct_names) - relation_field = ( - {self.relation_field} if self.relation_field is not None else set() - ) - allowed_fields = reserved | direct_weight_fields | relation_field + direct_weight_fields = set(self.property_names) + # Identity token "relation" maps to the default TigerGraph attribute name + # when physical fields are declared (see EdgeConfigDBAware.effective_weights). + logical_relation_attr = {DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME} + allowed_fields = reserved | direct_weight_fields | logical_relation_attr unknown_by_key = [ [token for token in key if token not in allowed_fields] for key in self.identities @@ -269,7 +199,7 @@ def _validate_identity_tokens(self) -> None: if unknown_by_key: raise ValueError( "Edge identity key fields must use reserved tokens " - "('source', 'target', 'relation') or declared edge direct/relation fields. " + "('source', 'target', 'relation') or declared edge property / relation fields. " f"Edge ({self.source}, {self.target}, {self.relation}) has unknown identity fields: {unknown_by_key}" ) @@ -287,6 +217,11 @@ def edge_id(self) -> EdgeId: """Alias for edge_id.""" return self.source, self.target, self.relation + @property + def property_names(self) -> list[str]: + """Declared materialized edge property names.""" + return [f.name for f in self.properties] + class EdgeConfig(ConfigBaseModel): """Configuration for managing collections of edges. @@ -300,7 +235,7 @@ class EdgeConfig(ConfigBaseModel): edges: list[Edge] = PydanticField( default_factory=list, - description="List of edge definitions (source, target, identities, weights, relation, etc.).", + description="List of edge definitions (source, target, identities, properties, relation, etc.).", ) _edges_map: dict[EdgeId, Edge] = PrivateAttr() @@ -364,6 +299,16 @@ def update_edges( vertex_config=vertex_config, ) + def edge_for(self, edge_id: EdgeId) -> Edge: + """Return the config-owned :class:`Edge` instance for ``edge_id`` after merges. + + Pipeline actors may construct a partial :class:`Edge` that is merged into the + schema edge via :meth:`update_edges`. Callers that need properties, identities, + etc. must use this object (same reference as in :meth:`items`), not the + pre-merge actor copy. + """ + return self._edges_map[edge_id] + @property def vertices(self): """Get set of vertex names involved in edges. diff --git a/graflo/architecture/schema/vertex.py b/graflo/architecture/schema/vertex.py index fc7370c7..9101de47 100644 --- a/graflo/architecture/schema/vertex.py +++ b/graflo/architecture/schema/vertex.py @@ -1,18 +1,18 @@ """Vertex configuration and management for graph databases. This module provides classes and utilities for managing vertices in graph databases. -It handles vertex configuration, field management, identity, and filtering operations. +It handles vertex configuration, property management, identity, and filtering operations. The module supports both ArangoDB and Neo4j through the DBType enum. Key Components: - - Vertex: Represents a vertex with its fields and identity + - Vertex: Represents a vertex with its properties and identity - VertexConfig: Manages vertices and their configurations Example: - >>> vertex = Vertex(name="user", fields=["id", "name"]) + >>> vertex = Vertex(name="user", properties=["id", "name"]) >>> config = VertexConfig(vertices=[vertex]) - >>> fields = config.fields("user") # Returns list[Field] - >>> field_names = config.fields_names("user") # Returns list[str] + >>> props = config.properties("user") # Returns list[Field] + >>> prop_names = config.property_names("user") # Returns list[str] """ from __future__ import annotations @@ -36,8 +36,8 @@ logger = logging.getLogger(__name__) -# Type accepted for fields before normalization (for use by Edge/WeightConfig) -FieldsInputType = list[str] | list["Field"] | list[dict[str, Any]] +# Type accepted for vertex properties before normalization (for use by Edge/WeightConfig) +PropertiesInputType = list[str] | list["Field"] | list[dict[str, Any]] class FieldType(BaseEnum): @@ -187,30 +187,30 @@ def _normalize_fields_item(item: str | Field | dict[str, Any]) -> Field: class Vertex(ConfigBaseModel): """Represents a vertex in the graph database. - A vertex is a fundamental unit in the graph that can have fields, identity, - and filters. Fields can be specified as strings, Field objects, or dicts. - Internally, fields are stored as Field objects but behave like strings - for backward compatibility. + A vertex is a fundamental unit in the graph that can have properties, identity, + and filters. Properties can be specified as strings, Field objects, or dicts. + Internally, properties are stored as Field objects but behave like strings + where a string-like Field is needed. Attributes: name: Name of the vertex - fields: List of field names (str), Field objects, or dicts. + properties: List of field names (str), Field objects, or dicts. Will be normalized to Field objects by the validator. - identity: List of fields forming logical primary identity + identity: List of property names forming logical primary identity filters: List of filter expressions Examples: - >>> # Backward compatible: list of strings - >>> v1 = Vertex(name="user", fields=["id", "name"]) + >>> # List of strings + >>> v1 = Vertex(name="user", properties=["id", "name"]) - >>> # Typed fields: list of Field objects - >>> v2 = Vertex(name="user", fields=[ + >>> # Typed properties: list of Field objects + >>> v2 = Vertex(name="user", properties=[ ... Field(name="id", type="INT"), ... Field(name="name", type="STRING") ... ]) >>> # From dicts (e.g., from YAML/JSON) - >>> v3 = Vertex(name="user", fields=[ + >>> v3 = Vertex(name="user", properties=[ ... {"name": "id", "type": "INT"}, ... {"name": "name"} # defaults to None type ... ]) @@ -223,13 +223,13 @@ class Vertex(ConfigBaseModel): ..., description="Name of the vertex type (e.g. user, post, company).", ) - fields: list[Field] = PydanticField( + properties: list[Field] = PydanticField( default_factory=list, description="List of fields (names, Field objects, or dicts). Normalized to Field objects.", ) identity: list[str] = PydanticField( default_factory=list, - description="Logical identity fields (primary key semantics for matching/upserts).", + description="Logical identity property names (primary key semantics for matching/upserts).", ) filters: list[FilterExpression] = PydanticField( default_factory=list, @@ -240,11 +240,11 @@ class Vertex(ConfigBaseModel): description="Optional semantic description of the vertex meaning, role, and intended interpretation.", ) - @field_validator("fields", mode="before") + @field_validator("properties", mode="before") @classmethod - def convert_to_fields(cls, v: Any) -> Any: + def convert_to_properties(cls, v: Any) -> Any: if not isinstance(v, list): - raise ValueError("fields must be a list") + raise ValueError("properties must be a list") return [_normalize_fields_item(item) for item in v] @field_validator("filters", mode="before") @@ -277,27 +277,27 @@ def convert_identity(cls, v: Any) -> Any: @model_validator(mode="after") def set_identity(self) -> "Vertex": - identity_fields = list(self.identity) - if not identity_fields: - identity_fields = [f.name for f in self.fields] - object.__setattr__(self, "identity", identity_fields) - - seen_names = {f.name for f in self.fields} - new_fields = list(self.fields) - for field_name in identity_fields: - if field_name not in seen_names: - new_fields.append(Field(name=field_name, type=None)) - seen_names.add(field_name) - object.__setattr__(self, "fields", new_fields) + identity_names = list(self.identity) + if not identity_names: + identity_names = [f.name for f in self.properties] + object.__setattr__(self, "identity", identity_names) + + seen_names = {f.name for f in self.properties} + augmented = list(self.properties) + for name in identity_names: + if name not in seen_names: + augmented.append(Field(name=name, type=None)) + seen_names.add(name) + object.__setattr__(self, "properties", augmented) return self @property - def field_names(self) -> list[str]: - """Get list of field names (as strings).""" - return [field.name for field in self.fields] + def property_names(self) -> list[str]: + """Property names as strings (Field.name for each entry).""" + return [field.name for field in self.properties] - def get_fields(self) -> list[Field]: - return self.fields + def get_properties(self) -> list[Field]: + return self.properties def finish_init(self): """Complete logical initialization for vertex.""" @@ -321,7 +321,7 @@ class VertexConfig(ConfigBaseModel): vertices: list[Vertex] = PydanticField( ..., - description="List of vertex type definitions (name, fields, identity, filters).", + description="List of vertex type definitions (name, properties, identity, filters).", ) blank_vertices: list[str] = PydanticField( default_factory=list, @@ -358,9 +358,9 @@ def _normalize_vertex_identities( vertex.identity = [blank_id_field] if not vertex.identity: raise ValueError(f"Vertex '{vertex.name}' must define identity fields") - missing = [f for f in vertex.identity if f not in vertex.field_names] + missing = [f for f in vertex.identity if f not in vertex.property_names] for field_name in missing: - vertex.fields.append(Field(name=field_name, type=None)) + vertex.properties.append(Field(name=field_name, type=None)) def _get_vertices_map(self) -> dict[str, Vertex]: """Return the vertices map (set by model validator).""" @@ -401,33 +401,21 @@ def identity_fields(self, vertex_name: str) -> list[str]: """Get identity fields for a vertex.""" return list(self._get_vertices_map()[vertex_name].identity) - def fields(self, vertex_name: str) -> list[Field]: - """Get fields for a vertex. + def properties(self, vertex_name: str) -> list[Field]: + """Vertex properties as Field objects.""" - Args: - vertex_name: Name of the vertex or storage name - - Returns: - list[Field]: List of Field objects - """ vertex = self._get_vertex_by_name(vertex_name) - return vertex.fields + return vertex.properties - def fields_names( + def property_names( self, vertex_name: str, ) -> list[str]: - """Get field names for a vertex as strings. + """Vertex property names as strings.""" - Args: - vertex_name: Name of the vertex or storage name - - Returns: - list[str]: List of field names as strings - """ vertex = self._get_vertex_by_name(vertex_name) - return vertex.field_names + return vertex.property_names def numeric_fields_list(self, vertex_name): """Get list of numeric fields for a vertex. diff --git a/graflo/db/arango/conn.py b/graflo/db/arango/conn.py index 9b364ba9..2cfd9dc6 100644 --- a/graflo/db/arango/conn.py +++ b/graflo/db/arango/conn.py @@ -55,6 +55,59 @@ _json_serializer = json_serializer +def _arango_edge_endpoint_aql( + vertex_class: str, + match_keys: tuple[str, ...], + slot: int, +) -> tuple[str, str]: + """Build the endpoint _id expression and optional LET … FOR v IN … prefix.""" + cell = f"edge[{slot}]" + if match_keys[0] == "_key": + return f'CONCAT("{vertex_class}/", {cell}._key)', "" + bind = "sources" if slot == 0 else "targets" + filt = " && ".join(f"v.{k} == {cell}.{k}" for k in match_keys) + prefix = f"LET {bind} = (FOR v IN {vertex_class} FILTER {filt} LIMIT 1 RETURN v)" + return f"{bind}[0]._id", prefix + + +def _arango_edge_upsert_key_exprs( + result_from: str, + result_to: str, + source_prefix: str, + target_prefix: str, +) -> tuple[str, str]: + """Use literal endpoint expressions when known; otherwise match on persisted doc.""" + ups_from = result_from if source_prefix else "doc._from" + ups_to = result_to if target_prefix else "doc._to" + return ups_from, ups_to + + +def _arango_edge_uniq_weight_keys( + uniq_weight_fields: Any, + uniq_weight_collections: Any, + relation_name: str | None, +) -> list[str]: + keys: list[str] = [] + if uniq_weight_fields is not None: + keys.extend(uniq_weight_fields) + if uniq_weight_collections is not None: + keys.extend(uniq_weight_collections) + if relation_name is not None: + keys.append("relation") + return keys + + +def _arango_edge_upsert_match_literal( + ups_from: str, + ups_to: str, + weight_keys: list[str], +) -> str: + inner = f"'_from': {ups_from}, '_to': {ups_to}" + if weight_keys: + inner += ", " + ", ".join(f"'{k}': edge.{k}" for k in weight_keys) + return "{" + inner + "}" + + class ArangoConnection(Connection): """ArangoDB-specific implementation of the Connection interface. @@ -639,9 +692,6 @@ def _graph_name(graph_item: Any) -> str | None: logger.info([]) if delete_all: - logger.warning( - "delete_graph_structure(delete_all=True) will remove all non-system ArangoDB graphs and collections in the selected database" - ) collections_result = self.conn.collections() graphs_result = self.conn.graphs() cnames = [] @@ -821,101 +871,67 @@ def insert_edges_batch( **kwargs: Additional options: - dry: If True, don't execute the query - collection_name: Edge collection name (defaults to {source_class}_{target_class}_edges if not provided) - - uniq_weight_fields: Fields to consider for uniqueness - - uniq_weight_collections: Classes to consider for uniqueness - - upsert_option: If True, use upsert instead of insert + - uniq_weight_fields: Fields included in UPSERT match (parallel edges) + - uniq_weight_collections: Extra match keys (legacy; list of names) + - on_duplicate: ``\"upsert\"`` (AQL UPSERT) or ``\"ignore\"`` (default: + ``INSERT`` with ``ignoreErrors``) - relationship_merge_properties: Ignored (Cypher backends only) """ opts = consume_insert_edges_kwargs(kwargs) - dry = opts.dry - collection_name = opts.collection_name - if collection_name is None: - collection_name = f"{source_class}_{target_class}_edges" - - uniq_weight_fields = opts.uniq_weight_fields - uniq_weight_collections = opts.uniq_weight_collections - upsert_option = opts.upsert_option - - if isinstance(docs_edges, list): - if docs_edges: - logger.debug(f" docs_edges[0] = {docs_edges[0]}") - if head is not None: - docs_edges = docs_edges[:head] - if filter_uniques: - docs_edges = pick_unique_dict(docs_edges) - docs_edges_str = json.dumps(docs_edges) - else: + collection_name = opts.collection_name or f"{source_class}_{target_class}_edges" + + if not isinstance(docs_edges, list): return + if docs_edges: + logger.debug(" docs_edges[0] = %s", docs_edges[0]) + if head is not None: + docs_edges = docs_edges[:head] + if filter_uniques: + docs_edges = pick_unique_dict(docs_edges) + docs_edges_str = json.dumps(docs_edges) - if match_keys_source[0] == "_key": - result_from = f'CONCAT("{source_class}/", edge[0]._key)' - source_filter = "" - else: - result_from = "sources[0]._id" - filter_source = " && ".join( - [f"v.{k} == edge[0].{k}" for k in match_keys_source] - ) - source_filter = ( - f"LET sources = (FOR v IN {source_class} FILTER" - f" {filter_source} LIMIT 1 RETURN v)" - ) + result_from, source_filter = _arango_edge_endpoint_aql( + source_class, match_keys_source, 0 + ) + result_to, target_filter = _arango_edge_endpoint_aql( + target_class, match_keys_target, 1 + ) + doc_definition = f"MERGE({{_from : {result_from}, _to : {result_to}}}, edge[2])" + logger.debug(" source_filter = %s", source_filter) + logger.debug(" target_filter = %s", target_filter) + logger.debug(" doc = %s", doc_definition) - if match_keys_target[0] == "_key": - result_to = f'CONCAT("{target_class}/", edge[1]._key)' - target_filter = "" - else: - result_to = "targets[0]._id" - filter_target = " && ".join( - [f"v.{k} == edge[1].{k}" for k in match_keys_target] + if opts.on_duplicate == "upsert": + ups_from, ups_to = _arango_edge_upsert_key_exprs( + result_from, result_to, source_filter, target_filter ) - target_filter = ( - f"LET targets = (FOR v IN {target_class} FILTER" - f" {filter_target} LIMIT 1 RETURN v)" + weight_keys = _arango_edge_uniq_weight_keys( + opts.uniq_weight_fields, + opts.uniq_weight_collections, + relation_name, ) - - doc_definition = f"MERGE({{_from : {result_from}, _to : {result_to}}}, edge[2])" - - logger.debug(f" source_filter = {source_filter}") - logger.debug(f" target_filter = {target_filter}") - logger.debug(f" doc = {doc_definition}") - - if upsert_option: - ups_from = result_from if source_filter else "doc._from" - ups_to = result_to if target_filter else "doc._to" - - weight_fs = [] - if uniq_weight_fields is not None: - weight_fs += uniq_weight_fields - if uniq_weight_collections is not None: - weight_fs += uniq_weight_collections - if relation_name is not None: - weight_fs += ["relation"] - - if weight_fs: - weights_clause = ", " + ", ".join( - [f"'{x}' : edge.{x}" for x in weight_fs] - ) - else: - weights_clause = "" - - upsert = f"{{'_from': {ups_from}, '_to': {ups_to}" + weights_clause + "}" - logger.debug(f" upsert clause: {upsert}") - clauses = f"UPSERT {upsert} INSERT doc UPDATE {{}}" + upsert_literal = _arango_edge_upsert_match_literal( + ups_from, ups_to, weight_keys + ) + logger.debug(" upsert clause: %s", upsert_literal) + clauses = f"UPSERT {upsert_literal} INSERT doc UPDATE {{}}" options = "OPTIONS {exclusive: true}" - else: + elif opts.on_duplicate == "ignore": if relation_name is None: doc_clause = "doc" else: doc_clause = f"MERGE(doc, {{'relation': '{relation_name}' }})" clauses = f"INSERT {doc_clause}" options = "OPTIONS {exclusive: true, ignoreErrors: true}" + else: + raise AssertionError(f"unexpected on_duplicate: {opts.on_duplicate!r}") q_update = f""" FOR edge in {docs_edges_str} {source_filter} {target_filter} LET doc = {doc_definition} {clauses} in {collection_name} {options}""" - if not dry: + if not opts.dry: self.execute(q_update) def insert_return_batch(self, docs: list[dict[str, Any]], class_name: str) -> str: diff --git a/graflo/db/arango/util.py b/graflo/db/arango/util.py index e4fae355..df5b9c25 100644 --- a/graflo/db/arango/util.py +++ b/graflo/db/arango/util.py @@ -42,17 +42,16 @@ def define_extra_edges(g: Edge): >>> # Generates query to create user->post edges through comments """ ucol, vcol, wcol = g.source, g.target, g.by - weight = g.weights + weight = g.properties s = f"""FOR w IN {wcol} LET uset = (FOR u IN 1..1 INBOUND w {ucol}_{wcol}_edges RETURN u) LET vset = (FOR v IN 1..1 INBOUND w {vcol}_{wcol}_edges RETURN v) FOR u in uset FOR v in vset """ - if weight is None: - raise ValueError("WeightConfig is required for edge list rendering") - # WeightConfig.direct is list[Field]; AQL copies each field from w to the edge - s_ins_ = ", ".join([f"{f.name}: w.{f.name}" for f in weight.direct]) + if not weight: + raise ValueError("Edge attributes are required for edge list rendering") + s_ins_ = ", ".join([f"{f.name}: w.{f.name}" for f in weight]) s_ins_ = f"_from: u._id, _to: v._id, {s_ins_}" s_ins = f" INSERT {{{s_ins_}}} " s_last = f"IN {ucol}_{vcol}_edges" diff --git a/graflo/db/conn.py b/graflo/db/conn.py index 247393a4..280b6f3e 100644 --- a/graflo/db/conn.py +++ b/graflo/db/conn.py @@ -53,7 +53,7 @@ import abc import logging from dataclasses import dataclass -from typing import Any, ClassVar, TypeVar +from typing import Any, ClassVar, Literal, TypeVar from graflo.architecture.schema.edge import Edge from graflo.architecture.schema import Schema @@ -69,6 +69,19 @@ ConnectionType = TypeVar("ConnectionType", bound="Connection") +def _parse_on_duplicate(value: Any) -> Literal["upsert", "ignore"]: + if not isinstance(value, str): + raise TypeError( + "on_duplicate must be str ('upsert' or 'ignore'), " + f"got {type(value).__name__}" + ) + if value == "upsert": + return "upsert" + if value == "ignore": + return "ignore" + raise ValueError(f"on_duplicate must be 'upsert' or 'ignore', got {value!r}") + + @dataclass(frozen=True) class InsertEdgesKwArgs: """Keyword arguments shared by :meth:`Connection.insert_edges_batch` implementations.""" @@ -77,7 +90,7 @@ class InsertEdgesKwArgs: collection_name: str | None uniq_weight_fields: Any uniq_weight_collections: Any - upsert_option: bool + on_duplicate: Literal["upsert", "ignore"] relationship_merge_properties: Any @@ -93,7 +106,7 @@ def consume_insert_edges_kwargs(kwargs: dict[str, Any]) -> InsertEdgesKwArgs: collection_name=kwargs.pop("collection_name", None), uniq_weight_fields=kwargs.pop("uniq_weight_fields", None), uniq_weight_collections=kwargs.pop("uniq_weight_collections", None), - upsert_option=bool(kwargs.pop("upsert_option", False)), + on_duplicate=_parse_on_duplicate(kwargs.pop("on_duplicate", "ignore")), relationship_merge_properties=kwargs.pop("relationship_merge_properties", None), ) if kwargs: @@ -290,9 +303,11 @@ def insert_edges_batch( :func:`consume_insert_edges_kwargs`): - dry: If True, do not execute writes (supported where implemented) - collection_name: Edge collection (ArangoDB) or unused type-specific name - - uniq_weight_fields: Uniqueness fields (ArangoDB upsert) - - uniq_weight_collections: Uniqueness collections (ArangoDB upsert) - - upsert_option: Use upsert instead of insert (ArangoDB) + - uniq_weight_fields: Uniqueness fields (ArangoDB UPSERT match) + - uniq_weight_collections: Uniqueness collections (ArangoDB UPSERT) + - on_duplicate: ArangoDB only. ``\"ignore\"`` (default): ``INSERT`` with + ``ignoreErrors``; ``\"upsert\"``: AQL ``UPSERT`` when a matching edge + may already exist (align match keys with a unique index). - relationship_merge_properties: Property names for Cypher MERGE (Neo4j, FalkorDB, Memgraph) so parallel edges differ by weights """ diff --git a/graflo/db/falkordb/conn.py b/graflo/db/falkordb/conn.py index 4e413f69..c357d384 100644 --- a/graflo/db/falkordb/conn.py +++ b/graflo/db/falkordb/conn.py @@ -447,10 +447,6 @@ def delete_graph_structure( delete_all: If True, delete all nodes and relationships """ if delete_all: - logger.warning( - "delete_graph_structure(delete_all=True) will remove all nodes and relationships in the selected FalkorDB graph" - ) - # Delete all nodes and relationships in current graph try: self.execute("MATCH (n) DETACH DELETE n") logger.debug("Deleted all nodes and relationships from graph") diff --git a/graflo/db/manager.py b/graflo/db/manager.py index 634d7851..618192d7 100644 --- a/graflo/db/manager.py +++ b/graflo/db/manager.py @@ -63,13 +63,6 @@ class ConnectionManager: DBType.NEBULA: NebulaConnection, } - # Source database connections (INPUT) - to be implemented - # source_conn_mapping = { - # DBType.POSTGRES: PostgresConnection, - # DBType.MYSQL: MySQLConnection, - # DBType.MONGODB: MongoDBConnection, - # } - def __init__( self, connection_config: DBConfig, diff --git a/graflo/db/nebula/conn.py b/graflo/db/nebula/conn.py index 51517da3..c008f983 100644 --- a/graflo/db/nebula/conn.py +++ b/graflo/db/nebula/conn.py @@ -264,7 +264,7 @@ def define_schema(self, schema: Schema) -> None: def define_vertex_classes(self, schema: Schema) -> None: for vname in schema.core_schema.vertex_config.vertex_set: - fields = schema.core_schema.vertex_config.fields(vname) + fields = schema.core_schema.vertex_config.properties(vname) stmt = create_tag_ngql(vname, fields) self._execute(stmt) self._tag_fields[vname] = [f.name for f in fields] @@ -281,8 +281,8 @@ def define_edge_classes(self, edges: list[Edge]) -> None: if rel in created: continue edge_fields = [] - if edge.weights and edge.weights.direct: - edge_fields = list(edge.weights.direct) + if edge.properties: + edge_fields = list(edge.properties) stmt = create_edge_type_ngql(rel, edge_fields) self._execute(stmt) created.add(rel) @@ -304,7 +304,7 @@ def define_vertex_indexes( "Schema is None: identity indexes cannot be ensured without schema" ) for vname in vertex_config.vertex_set: - fields = vertex_config.fields(vname) + fields = vertex_config.properties(vname) string_fields = {f.name for f in fields if f.type == FieldType.STRING} index_list = ( schema.db_profile.vertex_secondary_indexes(vname) diff --git a/graflo/db/neo4j/conn.py b/graflo/db/neo4j/conn.py index e9cd2ae8..38d55fd1 100644 --- a/graflo/db/neo4j/conn.py +++ b/graflo/db/neo4j/conn.py @@ -364,9 +364,6 @@ def delete_graph_structure( return if delete_all: - logger.warning( - "delete_graph_structure(delete_all=True) will remove all nodes and relationships in the selected Neo4j database" - ) self._drop_all_user_indexes_and_constraints() self.execute("MATCH (n) DETACH DELETE n") return @@ -563,7 +560,7 @@ def insert_edges_batch( - collection_name: Unused in Neo4j (kept for interface compatibility) - uniq_weight_fields: Unused (ArangoDB upsert); use relationship_merge_properties instead - uniq_weight_collections: Unused in Neo4j (ArangoDB-specific) - - upsert_option: Unused in Neo4j (ArangoDB-specific, MERGE is always upsert) + - on_duplicate: Unused in Neo4j (ArangoDB-specific AQL policy) - relationship_merge_properties: Property names included in ``MERGE`` so parallel edges (same endpoints and type, different weights) are distinct. """ diff --git a/graflo/db/postgres/schema_inference.py b/graflo/db/postgres/schema_inference.py index 5bf39f20..b105f6d9 100644 --- a/graflo/db/postgres/schema_inference.py +++ b/graflo/db/postgres/schema_inference.py @@ -10,7 +10,7 @@ from typing import TYPE_CHECKING -from graflo.architecture.schema.edge import Edge, EdgeConfig, WeightConfig +from graflo.architecture.schema.edge import Edge, EdgeConfig from graflo.architecture.database_features import DatabaseProfile from graflo.architecture.schema import CoreSchema, GraphMetadata, Schema from graflo.architecture.schema.vertex import Field, FieldType, Vertex, VertexConfig @@ -78,7 +78,7 @@ def infer_vertex_config( # Create vertex vertex = Vertex( name=table_name, - fields=fields, + properties=fields, identity=list(pk_columns), ) @@ -220,8 +220,8 @@ def _infer_type_from_samples( ) return mapped_type - def infer_edge_weights(self, edge_table_info: EdgeTableInfo) -> WeightConfig | None: - """Infer edge weights from edge table columns with types. + def infer_edge_weights(self, edge_table_info: EdgeTableInfo) -> list[Field] | None: + """Infer edge attributes from edge table columns with types. Uses PostgreSQL column types and optionally samples data to infer accurate types. @@ -229,7 +229,7 @@ def infer_edge_weights(self, edge_table_info: EdgeTableInfo) -> WeightConfig | N edge_table_info: Edge table information from introspection Returns: - WeightConfig if there are weight columns, None otherwise + List of attribute fields if there are non-key columns, None otherwise. """ columns = edge_table_info.columns pk_columns = set(edge_table_info.primary_key) @@ -264,7 +264,7 @@ def infer_edge_weights(self, edge_table_info: EdgeTableInfo) -> WeightConfig | N f"'{edge_table_info.name}': {[f.name for f in direct_weights]}" ) - return WeightConfig(direct=direct_weights) + return direct_weights def infer_edge_config( self, @@ -305,13 +305,11 @@ def infer_edge_config( ) continue - # Infer weights - weights = self.infer_edge_weights(edge_table_info) - # Create edge + attrs = self.infer_edge_weights(edge_table_info) or [] edge = Edge( source=source_table, target=target_table, - weights=weights, + properties=attrs, relation=edge_table_info.relation, ) diff --git a/graflo/db/tigergraph/conn.py b/graflo/db/tigergraph/conn.py index 88bd53ca..ec1cde9c 100644 --- a/graflo/db/tigergraph/conn.py +++ b/graflo/db/tigergraph/conn.py @@ -38,7 +38,8 @@ # Removed pyTigerGraph dependency - using direct REST API calls instead -from graflo.architecture.schema.edge import Edge +from graflo.architecture.schema.db_aware import EdgeConfigDBAware +from graflo.architecture.schema.edge import DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME, Edge from graflo.architecture.database_features import DatabaseProfile from graflo.architecture.schema import VertexConfigDBAware from graflo.architecture.graph_types import Index @@ -1075,9 +1076,11 @@ def _upsert_edge( Response from API """ graph_name = graph_name or self.graphname + # TigerGraph 4.2+: .../edges/{source_type}/{source_id}/{edge_type}/{target_type}/{target_id} endpoint = ( - f"/graph/{graph_name}/edges/{edge_type}/" + f"/graph/{graph_name}/edges/" f"{source_type}/{quote(str(source_id))}/" + f"{edge_type}/" f"{target_type}/{quote(str(target_id))}" ) data = attributes if attributes else {} @@ -1635,7 +1638,7 @@ def _get_vertex_add_statement(self, vertex: Vertex, vertex_config) -> str: # Get field type for primary key field(s) - convert FieldType enum to string field_type_map = {} - for f in vertex.fields: + for f in vertex.properties: if f.type: field_type_map[f.name] = ( f.type.value if hasattr(f.type, "value") else str(f.type) @@ -1645,7 +1648,7 @@ def _get_vertex_add_statement(self, vertex: Vertex, vertex_config) -> str: # Format all fields all_fields = [] - for field in vertex.fields: + for field in vertex.properties: if field.type: field_type = ( field.type.value @@ -1694,6 +1697,16 @@ def _get_vertex_add_statement(self, vertex: Vertex, vertex_config) -> str: f' ) WITH STATS="OUTDEGREE_BY_EDGETYPE"' ) + def _edge_for_tigergraph_ddl(self, edge: Edge, ec_db: EdgeConfigDBAware) -> Edge: + """Deep-copy edge with TigerGraph-effective weights for GSQL (non-mutating on schema).""" + ew = ec_db.effective_weights(edge) + edge_copy = edge.model_copy(deep=True) + if ew is not None: + edge_copy.properties = [f.model_copy(deep=True) for f in ew.direct] + else: + edge_copy.properties = [] + return edge_copy + def _format_edge_attributes( self, edge: Edge, exclude_fields: set[str] | None = None ) -> str: @@ -1706,14 +1719,14 @@ def _format_edge_attributes( Returns: str: Formatted attribute string (e.g., " date STRING,\n relation STRING") """ - if not edge.weights or not edge.weights.direct: + if not edge.properties: return "" if exclude_fields is None: exclude_fields = set() attr_parts = [] - for field in edge.weights.direct: + for field in edge.properties: field_name = field.name if field_name not in exclude_fields: field_type = self._get_tigergraph_type(field.type) @@ -1729,11 +1742,7 @@ def _edge_identity_discriminator_fields(self, edge: Edge) -> set[str]: if token in {"source", "target"}: continue if token == "relation": - relation_field = edge.relation_field - if relation_field is None and edge.relation_from_key: - relation_field = "relation" - if relation_field is not None: - fields.add(relation_field) + fields.add(DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME) continue if token not in {"_from", "_to"}: fields.add(token) @@ -1759,32 +1768,18 @@ def _get_edge_add_statement( indexed_field_names = self._edge_identity_discriminator_fields(edge) # IMPORTANT: In TigerGraph, discriminator fields MUST also be edge attributes. - # If an indexed field is not in weights.direct, we need to add it. - # Initialize weights if not present - if edge.weights is None: - from graflo.architecture.schema.edge import WeightConfig, Field - - edge.weights = WeightConfig() - - # Type assertion: weights is guaranteed to be WeightConfig after assignment - if edge.weights is None: - raise RuntimeError("weights should be initialized") - # Get existing weight field names - existing_weight_names = set() - if edge.weights.direct: - existing_weight_names = {field.name for field in edge.weights.direct} - - # Add any indexed fields that are missing from weights + # If an indexed field is not in attributes, we need to add it. + existing_weight_names = {f.name for f in edge.properties} + + # Add any indexed fields that are missing from attributes for field_name in indexed_field_names: if field_name not in existing_weight_names: - # Add the field to weights with STRING type (default) from graflo.architecture.schema.edge import Field - edge.weights.direct.append( - Field(name=field_name, type=FieldType.STRING) - ) + edge.properties.append(Field(name=field_name, type=FieldType.STRING)) + existing_weight_names.add(field_name) logger.info( - f"Added indexed field '{field_name}' to edge weights for discriminator compatibility" + f"Added indexed field '{field_name}' to edge attributes for discriminator compatibility" ) # Format edge attributes, excluding discriminator fields (they're in DISCRIMINATOR clause) @@ -1798,8 +1793,8 @@ def _get_edge_add_statement( # Get field types for discriminator fields field_types = {} - if edge.weights and edge.weights.direct: - for field in edge.weights.direct: + if edge.properties: + for field in edge.properties: field_types[field.name] = self._get_tigergraph_type(field.type) # Use sanitized dbname for schema names when available @@ -1826,8 +1821,7 @@ def _get_edge_add_statement( else: logger.debug( f"No identity discriminator fields found for edge {relation_db}. " - f"Identities: {edge.identities}, " - f"relation_field: {edge.relation_field}" + f"Identities: {edge.identities}, relation: {edge.relation}" ) # Combine FROM/TO and discriminator with commas @@ -1878,25 +1872,17 @@ def _get_edge_group_create_statement( # Collect identity discriminator fields (same logic as _get_edge_add_statement) indexed_field_names = self._edge_identity_discriminator_fields(first_edge) - # Ensure indexed fields are in weights (same logic as _get_edge_add_statement) - if first_edge.weights is None: - from graflo.architecture.schema.edge import WeightConfig - - first_edge.weights = WeightConfig() - - if first_edge.weights is None: - raise RuntimeError("weights should be initialized") - existing_weight_names = set() - if first_edge.weights.direct: - existing_weight_names = {field.name for field in first_edge.weights.direct} + # Ensure indexed fields are in attributes (same logic as _get_edge_add_statement) + existing_weight_names = {f.name for f in first_edge.properties} for field_name in indexed_field_names: if field_name not in existing_weight_names: from graflo.architecture.schema.edge import Field - first_edge.weights.direct.append( + first_edge.properties.append( Field(name=field_name, type=FieldType.STRING) ) + existing_weight_names.add(field_name) # Format edge attributes, excluding discriminator fields edge_attrs = self._format_edge_attributes( @@ -1905,8 +1891,8 @@ def _get_edge_group_create_statement( # Get field types for discriminator fields field_types = {} - if first_edge.weights and first_edge.weights.direct: - for field in first_edge.weights.direct: + if first_edge.properties: + for field in first_edge.properties: field_types[field.name] = self._get_tigergraph_type(field.type) # Build FROM/TO pairs for all edges, separated by | @@ -2095,11 +2081,23 @@ def _define_schema_local(self, schema: Schema) -> None: # Create one statement per relation with all FROM/TO pairs for relation, edge_group in edges_by_relation.items(): + ddl_edges = [ + self._edge_for_tigergraph_ddl(e, db_schema.edge_config) + for e in edge_group + ] + ddl_source_vertices = { + id(de): source_vertices[id(og)] + for de, og in zip(ddl_edges, edge_group, strict=True) + } + ddl_target_vertices = { + id(de): target_vertices[id(og)] + for de, og in zip(ddl_edges, edge_group, strict=True) + } stmt = self._get_edge_group_create_statement( - edge_group, + ddl_edges, relation_name=relation, - source_vertices=source_vertices, - target_vertices=target_vertices, + source_vertices=ddl_source_vertices, + target_vertices=ddl_target_vertices, ) edge_stmts.append(stmt) @@ -2504,7 +2502,7 @@ def _format_vertex_fields(self, vertex: Vertex) -> str: Returns: str: Formatted field definitions for GSQL CREATE VERTEX statement """ - fields = vertex.fields + fields = vertex.properties if not fields: # Default fields if none specified @@ -2524,14 +2522,13 @@ def _format_edge_attributes_for_create(self, edge: Edge) -> str: """ Format edge attributes for GSQL CREATE EDGE statement. - Edge weights/attributes come from edge.weights.direct (list of Field objects). - Each weight field needs to be included in the CREATE EDGE statement with its type. + Edge properties come from edge.properties (list of Field objects). + Each attribute field needs to be included in the CREATE EDGE statement with its type. """ attrs = [] - # Get weight fields from edge.weights.direct - if edge.weights and edge.weights.direct: - for field in edge.weights.direct: + if edge.properties: + for field in edge.properties: # Field objects have name and type attributes field_name = field.name # Get TigerGraph type - FieldType enum values are already in TigerGraph format @@ -3563,7 +3560,7 @@ def insert_edges_batch( - collection_name: Alternative edge type name (used if relation_name is None) - uniq_weight_fields: Unused in TigerGraph (ArangoDB-specific) - uniq_weight_collections: Unused in TigerGraph (ArangoDB-specific) - - upsert_option: Unused in TigerGraph (ArangoDB-specific, always upserts by default) + - on_duplicate: Unused in TigerGraph (ArangoDB-specific AQL policy) - relationship_merge_properties: Unused (Cypher property-graph backends only) """ opts = consume_insert_edges_kwargs(kwargs) @@ -3842,7 +3839,9 @@ def fetch_docs( vertex_config = kwargs.get("vertex_config") if field_types is None and vertex_config is not None: - field_types = {f.name: f.type for f in vertex_config.fields(class_name)} + field_types = { + f.name: f.type for f in vertex_config.properties(class_name) + } # Build REST++ filter string with field type information filter_str = self._render_rest_filter(filters, field_types=field_types) diff --git a/graflo/hq/auto_join.py b/graflo/hq/auto_join.py index f0ac1034..1b654f2d 100644 --- a/graflo/hq/auto_join.py +++ b/graflo/hq/auto_join.py @@ -1,6 +1,6 @@ """Auto-JOIN generation for edge resources. -When a Resource's pipeline contains an EdgeActor whose edge has +When a Resource's pipeline contains an EdgeActor whose ``derivation`` declares ``match_source`` / ``match_target``, and the source/target vertex types have known table connectors, this module can auto-generate JoinClauses and IS_NOT_NULL filters on the edge resource's table connector so that the @@ -62,7 +62,8 @@ def enrich_edge_connector_with_joins( for ea in edge_actors: edge = ea.edge - if not edge.match_source or not edge.match_target: + der = ea.derivation + if not der.match_source or not der.match_target: continue source_info = _vertex_table_info(edge.source, bindings, vertex_config) @@ -86,7 +87,7 @@ def enrich_edge_connector_with_joins( table=src_table, schema_name=src_schema, alias=src_alias, - on_self=edge.match_source, + on_self=der.match_source, on_other=src_pk, join_type="LEFT", ) @@ -96,7 +97,7 @@ def enrich_edge_connector_with_joins( table=tgt_table, schema_name=tgt_schema, alias=tgt_alias, - on_self=edge.match_target, + on_self=der.match_target, on_other=tgt_pk, join_type="LEFT", ) diff --git a/graflo/hq/caster.py b/graflo/hq/caster.py index 2744cb40..7edb5309 100644 --- a/graflo/hq/caster.py +++ b/graflo/hq/caster.py @@ -566,6 +566,7 @@ def ingest( strict_references=ingestion_params.strict_references, dynamic_edge_feedback=ingestion_params.dynamic_edges, allowed_vertex_names=self._allowed_vertex_names, + target_db_flavor=db_flavor, ) registry = RegistryBuilder(self.schema, self.ingestion_model).build( diff --git a/graflo/hq/db_writer.py b/graflo/hq/db_writer.py index cb9e253f..48d9e858 100644 --- a/graflo/hq/db_writer.py +++ b/graflo/hq/db_writer.py @@ -183,10 +183,11 @@ async def _enrich_extra_weights( def _sync(): with ConnectionManager(connection_config=conn_conf) as db: - for edge in resource.extra_weights: - if edge.weights is None: + for entry in resource.extra_weights: + edge = entry.edge + if not entry.vertex_weights: continue - for weight in edge.weights.vertices: + for weight in entry.vertex_weights: if weight.name not in vc.vertex_set: logger.error(f"{weight.name} not a valid vertex") continue @@ -197,7 +198,7 @@ def _sync(): class_name=vc.vertex_dbname(weight.name), batch=gc.vertices[weight.name], match_keys=index_fields, - keep_keys=weight.fields, + keep_keys=weight.properties, ) for j, item in enumerate(gc.linear): weights = weights_per_item[j] @@ -226,14 +227,9 @@ def _sync(): with ConnectionManager(connection_config=conn_conf) as db: runtime = ec.runtime(edge) merge_props: tuple[str, ...] | None = None - if conn_conf.connection_type in ( - DBType.NEO4J, - DBType.FALKORDB, - DBType.MEMGRAPH, - ): - mp = ec.relationship_merge_property_names(edge) - if mp: - merge_props = tuple(mp) + mp = ec.relationship_merge_property_names(edge) + if mp: + merge_props = tuple(mp) for ee in gc.loop_over_relations(edge_id): _, _, relation = ee if not self.dry: @@ -248,10 +244,25 @@ def _sync(): "dry": self.dry, "collection_name": runtime.storage_name(), } - if merge_props is not None: - edge_kw["relationship_merge_properties"] = ( - merge_props - ) + if conn_conf.connection_type in ( + DBType.NEO4J, + DBType.FALKORDB, + DBType.MEMGRAPH, + ): + if merge_props is not None: + edge_kw["relationship_merge_properties"] = ( + merge_props + ) + elif conn_conf.connection_type == DBType.ARANGO: + if ( + self.ingestion_model.edges_on_duplicate + == "upsert" + ): + edge_kw["on_duplicate"] = "upsert" + if merge_props is not None: + edge_kw["uniq_weight_fields"] = list( + merge_props + ) db.insert_edges_batch( docs_edges=data, source_class=vc.vertex_dbname(edge.source), diff --git a/graflo/hq/graph_engine.py b/graflo/hq/graph_engine.py index 1f0cd1b5..f448c6ca 100644 --- a/graflo/hq/graph_engine.py +++ b/graflo/hq/graph_engine.py @@ -30,6 +30,49 @@ logger = logging.getLogger(__name__) +def _graph_target_namespace_unset(target_db_config: DBConfig) -> bool: + """Return True if the connection has no graph/database/space name yet (per DB kind).""" + db_type = target_db_config.connection_type + if db_type == DBType.MEMGRAPH: + return target_db_config.database is None + if db_type in (DBType.TIGERGRAPH, DBType.NEBULA): + return target_db_config.schema_name is None + return target_db_config.database is None + + +def _assign_graph_target_namespace(target_db_config: DBConfig, namespace: str) -> None: + """Write ``namespace`` to the DB-specific field that names the target graph/space.""" + db_type = target_db_config.connection_type + if db_type in (DBType.TIGERGRAPH, DBType.NEBULA): + target_db_config.schema_name = namespace + else: + target_db_config.database = namespace + + +def _resolve_graph_target_namespace( + schema: Schema, graph_target_namespace: str | None +) -> str: + """Prefer explicit call arg, then profile target_namespace, then metadata name.""" + if graph_target_namespace is not None: + return graph_target_namespace + if schema.db_profile.target_namespace is not None: + return schema.db_profile.target_namespace + return schema.metadata.name + + +def _ensure_graph_target_namespace( + schema: Schema, + target_db_config: DBConfig, + graph_target_namespace: str | None, +) -> None: + if not target_db_config.can_be_target(): + return + if not _graph_target_namespace_unset(target_db_config): + return + resolved = _resolve_graph_target_namespace(schema, graph_target_namespace) + _assign_graph_target_namespace(target_db_config, resolved) + + class GraphEngine: """Orchestrator for graph database operations. @@ -164,6 +207,7 @@ def define_schema( manifest: GraphManifest, target_db_config: DBConfig, recreate_schema: bool = False, + graph_target_namespace: str | None = None, ) -> None: """Define schema in the target database. @@ -179,22 +223,13 @@ def define_schema( target_db_config: Target database connection configuration recreate_schema: If True, drop existing schema and define new one. If False and schema/graph already exists, raises SchemaExistsError. + graph_target_namespace: Optional target graph/database/space name (e.g. temp + schema). Overrides ``schema.db_profile.target_namespace`` and defaults + ahead of ``schema.metadata.name`` when the config omits the namespace. """ schema = manifest.require_schema() - # If effective_schema is not set, use schema.metadata.name as fallback - if ( - target_db_config.can_be_target() - and target_db_config.effective_schema is None - ): - schema_name = schema.metadata.name - # Map to the appropriate field based on DB type - if target_db_config.connection_type == DBType.TIGERGRAPH: - # TigerGraph uses 'schema_name' field - target_db_config.schema_name = schema_name - else: - # ArangoDB, Neo4j use 'database' field (which maps to effective_schema) - target_db_config.database = schema_name + _ensure_graph_target_namespace(schema, target_db_config, graph_target_namespace) # Ensure schema reflects target DB so finish_init applies DB-specific defaults. schema.db_profile.db_flavor = target_db_config.connection_type @@ -214,6 +249,7 @@ def define_and_ingest( connection_provider: ConnectionProvider | None = None, recreate_schema: bool | None = None, clear_data: bool | None = None, + graph_target_namespace: str | None = None, ) -> None: """Define schema and ingest data into the graph database in one operation. @@ -230,6 +266,8 @@ def define_and_ingest( define_schema raises SchemaExistsError and the script halts. clear_data: If True, remove existing data before ingestion (schema unchanged). If None, uses ingestion_params.clear_data. + graph_target_namespace: Optional target graph/database/space name; passed + to both ``define_schema`` and ``ingest`` for consistent resolution. """ ingestion_params = ingestion_params or IngestionParams() if clear_data is None: @@ -242,6 +280,7 @@ def define_and_ingest( manifest=manifest, target_db_config=target_db_config, recreate_schema=recreate_schema, + graph_target_namespace=graph_target_namespace, ) # Then ingest data (clear_data is applied inside ingest() when ingestion_params.clear_data) @@ -253,6 +292,7 @@ def define_and_ingest( target_db_config=target_db_config, ingestion_params=ingestion_params, connection_provider=connection_provider, + graph_target_namespace=graph_target_namespace, ) def ingest( @@ -261,6 +301,7 @@ def ingest( target_db_config: DBConfig, ingestion_params: IngestionParams | None = None, connection_provider: ConnectionProvider | None = None, + graph_target_namespace: str | None = None, ) -> None: """Ingest data into the graph database. @@ -272,11 +313,15 @@ def ingest( target_db_config: Target database connection configuration ingestion_params: IngestionParams instance with ingestion configuration. If None, uses default IngestionParams() + graph_target_namespace: Same semantics as ``define_schema``; use when + calling ``ingest`` without a prior ``define_schema`` on this config. """ schema = manifest.require_schema() ingestion_model = manifest.require_ingestion_model() bindings = manifest.bindings + _ensure_graph_target_namespace(schema, target_db_config, graph_target_namespace) + ingestion_params = ingestion_params or IngestionParams() if ingestion_params.clear_data: with ConnectionManager(connection_config=target_db_config) as db_client: diff --git a/graflo/hq/rdf_inferencer.py b/graflo/hq/rdf_inferencer.py index 9897b46e..1573e4b7 100644 --- a/graflo/hq/rdf_inferencer.py +++ b/graflo/hq/rdf_inferencer.py @@ -177,7 +177,7 @@ def infer_schema( vertices = [] for cls_name, fields in fields_by_class.items(): vertex_fields = [VertexField(name=f) for f in fields] - vertices.append(Vertex(name=cls_name, fields=vertex_fields)) + vertices.append(Vertex(name=cls_name, properties=vertex_fields)) vertex_config = VertexConfig(vertices=vertices) diff --git a/graflo/hq/sanitizer.py b/graflo/hq/sanitizer.py index 41abca72..44c2a40d 100644 --- a/graflo/hq/sanitizer.py +++ b/graflo/hq/sanitizer.py @@ -93,7 +93,7 @@ def sanitize( # Second pass: Sanitize vertex field names for vertex in schema.core_schema.vertex_config.vertices: - for field in vertex.fields: + for field in vertex.properties: original_name = field.name sanitized_name = sanitize_attribute_name( original_name, self.reserved_words @@ -324,22 +324,22 @@ def _normalize_vertex_indexes( # Update vertex index and fields vertex = schema.core_schema.vertex_config[vertex_name] - existing_field_names = {f.name for f in vertex.fields} + existing_field_names = {f.name for f in vertex.properties} # Add new fields that don't exist for new_field in most_popular_index: if new_field not in existing_field_names: - vertex.fields.append(Field(name=new_field, type=None)) + vertex.properties.append(Field(name=new_field, type=None)) existing_field_names.add(new_field) # Remove old fields that are being replaced (not in new index) fields_to_remove = [ f - for f in vertex.fields + for f in vertex.properties if f.name in old_fields and f.name not in new_fields ] for field_to_remove in fields_to_remove: - vertex.fields.remove(field_to_remove) + vertex.properties.remove(field_to_remove) # Update logical identity to match the most popular one. vertex.identity = list(most_popular_index) diff --git a/graflo/migrate/diff.py b/graflo/migrate/diff.py index 1e8fbbdc..0af6289e 100644 --- a/graflo/migrate/diff.py +++ b/graflo/migrate/diff.py @@ -155,8 +155,8 @@ def _diff_vertices( ) ) - old_fields = _field_map(old_vertex.fields) - new_fields = _field_map(new_vertex.fields) + old_fields = _field_map(old_vertex.properties) + new_fields = _field_map(new_vertex.properties) old_field_names = set(old_fields) new_field_names = set(new_fields) @@ -245,8 +245,8 @@ def _diff_edges(self, conflicts: list[SchemaConflict]) -> list[MigrationOperatio ) ) - old_direct = _field_map(old_edge.weights.direct if old_edge.weights else []) - new_direct = _field_map(new_edge.weights.direct if new_edge.weights else []) + old_direct = _field_map(old_edge.properties) + new_direct = _field_map(new_edge.properties) old_names = set(old_direct) new_names = set(new_direct) diff --git a/graflo/migrate/io.py b/graflo/migrate/io.py index a71a243d..8387fc23 100644 --- a/graflo/migrate/io.py +++ b/graflo/migrate/io.py @@ -33,7 +33,10 @@ def load_ingestion_model( manifest = load_manifest(path) ingestion_model = manifest.require_ingestion_model() if schema is not None: - ingestion_model.finish_init(schema.core_schema) + ingestion_model.finish_init( + schema.core_schema, + target_db_flavor=schema.db_profile.db_flavor, + ) return ingestion_model diff --git a/graflo/plot/plotter.py b/graflo/plot/plotter.py index 6c685ab9..fb3568a8 100644 --- a/graflo/plot/plotter.py +++ b/graflo/plot/plotter.py @@ -26,6 +26,8 @@ from suthing import FileHandle from graflo.architecture import GraphManifest +from graflo.architecture.graph_types import EdgeId +from graflo.architecture.schema.edge import Edge from graflo.architecture.pipeline.runtime.actor import ( ActorWrapper, DescendActor, @@ -347,17 +349,23 @@ def _draw(self, ag, stem: str, prog: str = "dot") -> None: prog=prog, ) - def _discover_edges_from_resources(self): + def _discover_edges_from_resources( + self, + ) -> tuple[dict[EdgeId, Edge], dict[EdgeId, str], dict[EdgeId, bool]]: """Discover edges from resources by walking through ActorWrappers. This method finds all EdgeActors in resources and extracts their edges, - which may include edges with dynamic relations (relation_field, relation_from_key) - that aren't fully represented in edge_config. + which may include edges with dynamic relations (EdgeActor derivation) that + aren't fully represented in edge_config. Returns: - dict: Dictionary mapping (source, target, relation) to Edge objects + discovered_edges: map edge_id → Edge + relation_source_by_edge_id: document field for per-row relation (plot hint). + relation_from_key_by_edge_id: True when an edge step derives relation from keys. """ - discovered_edges = {} + discovered_edges: dict[EdgeId, Edge] = {} + relation_source_by_edge_id: dict[EdgeId, str] = {} + relation_from_key_by_edge_id: dict[EdgeId, bool] = {} for resource in self.ingestion_model.resources: # Collect all actors from the resource's ActorWrapper @@ -371,17 +379,30 @@ def _discover_edges_from_resources(self): # but allowing resource edges to supplement if edge_id not in discovered_edges: discovered_edges[edge_id] = edge - - return discovered_edges + if actor.relation_field is not None: + relation_source_by_edge_id[edge_id] = actor.relation_field + if actor.derivation.relation_from_key: + relation_from_key_by_edge_id[edge_id] = True + + return ( + discovered_edges, + relation_source_by_edge_id, + relation_from_key_by_edge_id, + ) @staticmethod - def _edge_label(edge) -> str | None: + def _edge_label( + edge: Edge, + *, + relation_source_field: str | None = None, + relation_from_key: bool = False, + ) -> str | None: """Build the human-readable edge label for plotting.""" if edge.relation is not None: return edge.relation - if edge.relation_field is not None: - return f"[{edge.relation_field}]" - if edge.relation_from_key: + if relation_source_field is not None: + return f"[{relation_source_field}]" + if relation_from_key: return "[key]" return None @@ -607,7 +628,7 @@ def plot_vc2fields(self): kwargs = {"vfield": True, "vertex_sh": vertex_prefix_dict} for k in vconf.vertex_set: index_fields = vconf.identity_fields(k) - fields = vconf.fields_names(k) + fields = vconf.property_names(k) kwargs["vertex"] = k nodes_collection = [ ( @@ -807,7 +828,11 @@ def plot_vc2vc( nodes: list[tuple[str, dict[str, object]]] = [] rendered_edges: list[tuple] = [] - discovered_edges = self._discover_edges_from_resources() + ( + discovered_edges, + relation_source_by_edge_id, + relation_from_key_by_edge_id, + ) = self._discover_edges_from_resources() configured_edges = dict(self.schema.core_schema.edge_config.items()) all_edges = self._merge_edges(configured_edges, discovered_edges) @@ -825,8 +850,13 @@ def plot_vc2vc( ) edge_pairs = [(source, target) for (source, target, _relation) in valid_edges] - for (source, target, _relation), edge in valid_edges.items(): - label = self._edge_label(edge) + for edge_id, edge in valid_edges.items(): + label = self._edge_label( + edge, + relation_source_field=relation_source_by_edge_id.get(edge_id), + relation_from_key=relation_from_key_by_edge_id.get(edge_id, False), + ) + source, target = edge_id[0], edge_id[1] source_id = get_auxnode_id(AuxNodeType.VERTEX, vertex=source) target_id = get_auxnode_id(AuxNodeType.VERTEX, vertex=target) if label is None: diff --git a/pyproject.toml b/pyproject.toml index 89a7efad..0beeefb9 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -38,7 +38,7 @@ description = "A framework for transforming tabular (CSV, SQL) and hierarchical name = "graflo" readme = "README.md" requires-python = ">=3.11" -version = "1.7.9" +version = "1.7.10" [project.optional-dependencies] dev = [ diff --git a/test/architecture/_test_blank_vertices.py b/test/architecture/_test_blank_vertices.py index 22dd1063..7c3deb12 100644 --- a/test/architecture/_test_blank_vertices.py +++ b/test/architecture/_test_blank_vertices.py @@ -63,10 +63,9 @@ def schema_ibes_edges(): target: ticker - source: analyst target: agency - weights: - direct: - - datetime_review - - datetime_announce + properties: + - datetime_review + - datetime_announce - source: analyst target: publication - source: publication diff --git a/test/architecture/conftest.py b/test/architecture/conftest.py index 5acf6186..7e503a77 100644 --- a/test/architecture/conftest.py +++ b/test/architecture/conftest.py @@ -26,7 +26,7 @@ def vertex_pub(): """ name: publication dbname: publications - fields: + properties: - arxiv - doi - created @@ -92,27 +92,6 @@ def vertex_helper_b(): return tc -@pytest.fixture() -def edge_with_weights(): - tc = yaml.safe_load( - """ - source: analyst - target: agency - weights: - vertices: - - - name: ticker - fields: - - cusip - - - fields: - - datetime_review - - datetime_announce - """ - ) - return tc - - @pytest.fixture() def vertex_config_kg(): vc = yaml.safe_load( @@ -120,7 +99,7 @@ def vertex_config_kg(): vertices: - name: publication dbname: publications - fields: + properties: - arxiv - doi - created @@ -130,7 +109,7 @@ def vertex_config_kg(): - doi - name: entity dbname: entities - fields: + properties: - linker_type - ent_db_type - id @@ -142,7 +121,7 @@ def vertex_config_kg(): - ent_type - name: mention dbname: mentions - fields: + properties: - text identity: - _key @@ -195,7 +174,7 @@ def schema_vc_openalex(): vertices: - name: author dbname: authors - fields: + properties: - _key - display_name - updated_date @@ -203,7 +182,7 @@ def schema_vc_openalex(): - _key - name: concept dbname: concepts - fields: + properties: - _key - wikidata - display_name @@ -215,7 +194,7 @@ def schema_vc_openalex(): - _key - name: institution dbname: institutions - fields: + properties: - _key - display_name - country @@ -230,7 +209,7 @@ def schema_vc_openalex(): - _key - name: source dbname: sources - fields: + properties: - _key - issn_l - type @@ -242,7 +221,7 @@ def schema_vc_openalex(): - _key - name: work dbname: works - fields: + properties: - _key - doi - title @@ -313,13 +292,13 @@ def vertex_config_collision(): vertex_config: vertices: - name: person - fields: + properties: - id indexes: - fields: - id - name: company - fields: + properties: - id indexes: - fields: @@ -366,13 +345,13 @@ def vertex_config_cross(): vertex_config: vertices: - name: person - fields: + properties: - id indexes: - fields: - id - name: company - fields: + properties: - name indexes: - fields: @@ -398,7 +377,7 @@ def vc_openalex(): vertices: - name: author dbname: authors - fields: + properties: - _key - display_name indexes: @@ -406,7 +385,7 @@ def vc_openalex(): - _key - name: institution dbname: institutions - fields: + properties: - _key - display_name - country @@ -455,10 +434,9 @@ def resource_openalex_authors(): - _key - source: author target: institution - weights: - direct: - - updated_date - - created_date + properties: + - updated_date + - created_date """) return an @@ -508,9 +486,8 @@ def resource_kg_menton_triple(): target: mention match_source: triple_index match_target: triple - weights: - direct: - - _role + properties: + - _role """) return an @@ -522,7 +499,7 @@ def vertex_config_kg_mention(): vertices: - name: mention dbname: mentions - fields: + properties: - text identity: - _key @@ -562,7 +539,7 @@ def vertex_key_property(): vertex_config: vertices: - name: package - fields: + properties: - name - version indexes: @@ -578,21 +555,21 @@ def schema_vc_deb(): tc = yaml.safe_load(""" vertices: - name: package - fields: + properties: - name - version indexes: - fields: - name - name: maintainer - fields: + properties: - name - email indexes: - fields: - email - name: bug - fields: + properties: - id - subject - severity @@ -611,7 +588,7 @@ def vc_ticker(): vertices: - name: ticker dbname: tickers - fields: + properties: - cusip - cname - oftic @@ -622,7 +599,7 @@ def vc_ticker(): - oftic - name: feature dbname: features - fields: + properties: - name - value indexes: @@ -649,13 +626,8 @@ def ec_ticker(): edges: - source: ticker target: feature - weights: - direct: - - t_obs - vertices: - - name: feature - fields: - - name + properties: + - t_obs """ ) return EdgeConfig.from_dict(tc) @@ -668,7 +640,7 @@ def vc_ticker_filtered(): vertices: - name: ticker dbname: tickers - fields: + properties: - cusip - cname - oftic @@ -679,7 +651,7 @@ def vc_ticker_filtered(): - oftic - name: feature dbname: features - fields: + properties: - name - value indexes: diff --git a/test/architecture/test_actor.py b/test/architecture/test_actor.py index 2b0af914..6e3a7383 100644 --- a/test/architecture/test_actor.py +++ b/test/architecture/test_actor.py @@ -310,7 +310,7 @@ def test_find_descendants_vertex_by_from_doc( def test_explicit_format_pipeline_vertex_from_create_edge(): """Pipeline with vertex(from)+create_edge.""" - vc = VertexConfig.from_dict({"vertices": [{"name": "users", "fields": ["id"]}]}) + vc = VertexConfig.from_dict({"vertices": [{"name": "users", "properties": ["id"]}]}) pipeline = [ {"vertex": "users", "from": {"id": "follower_id"}}, {"vertex": "users", "from": {"id": "followed_id"}}, @@ -372,7 +372,7 @@ def test_transform_tuple_output_maps_to_vertex_index_fields_in_order(): "vertices": [ { "name": "pair", - "fields": ["left", "right"], + "properties": ["left", "right"], "indexes": [{"fields": ["left", "right"]}], } ] @@ -720,7 +720,7 @@ def test_transform_target_keys_updates_vertex_fields(): "vertices": [ { "name": "entity", - "fields": ["id", "label"], + "properties": ["id", "label"], "identity": ["id"], } ] @@ -766,7 +766,7 @@ def test_transform_target_keys_multiple_steps_compose_for_vertex(): "vertices": [ { "name": "entity", - "fields": ["id", "label"], + "properties": ["id", "label"], "identity": ["id"], } ] @@ -1096,12 +1096,12 @@ def test_transform_payload_consumption_avoids_cross_vertex_self_edge(): "vertices": [ { "name": "author", - "fields": ["id", "full_name", "hindex"], + "properties": ["id", "full_name", "hindex"], "identity": ["id"], }, { "name": "researchField", - "fields": ["id", "name", "level"], + "properties": ["id", "name", "level"], "identity": ["id"], }, ] @@ -1147,9 +1147,9 @@ def test_infer_edge_only_filters_greedy_edges(): vc = VertexConfig.from_dict( { "vertices": [ - {"name": "a", "fields": ["id"], "identity": ["id"]}, - {"name": "b", "fields": ["id"], "identity": ["id"]}, - {"name": "c", "fields": ["id"], "identity": ["id"]}, + {"name": "a", "properties": ["id"], "identity": ["id"]}, + {"name": "b", "properties": ["id"], "identity": ["id"]}, + {"name": "c", "properties": ["id"], "identity": ["id"]}, ] } ) @@ -1188,9 +1188,9 @@ def test_infer_edge_except_filters_greedy_edges(): vc = VertexConfig.from_dict( { "vertices": [ - {"name": "a", "fields": ["id"], "identity": ["id"]}, - {"name": "b", "fields": ["id"], "identity": ["id"]}, - {"name": "c", "fields": ["id"], "identity": ["id"]}, + {"name": "a", "properties": ["id"], "identity": ["id"]}, + {"name": "b", "properties": ["id"], "identity": ["id"]}, + {"name": "c", "properties": ["id"], "identity": ["id"]}, ] } ) diff --git a/test/architecture/test_csv_edge_weights.py b/test/architecture/test_csv_edge_weights.py index 84970011..c2b176c4 100644 --- a/test/architecture/test_csv_edge_weights.py +++ b/test/architecture/test_csv_edge_weights.py @@ -51,3 +51,10 @@ def test_csv_edge_weights_one_edge_per_row( assert total_edges == len(relations_data_csv_edge_weights), ( f"Expected {len(relations_data_csv_edge_weights)} edges (1 per row), got {total_edges}" ) + + # Direct weights (e.g. date) must be copied onto the edge payload. They can live on + # merged vertex docs after VertexActor passthrough, not only on VertexRep.ctx; null + # merge keys break Neo4j MERGE on relationship properties. + for edocs in graph.edges.values(): + for _u, _v, w in edocs: + assert w.get("date") is not None diff --git a/test/architecture/test_edge.py b/test/architecture/test_edge.py index a85f1261..be883c26 100644 --- a/test/architecture/test_edge.py +++ b/test/architecture/test_edge.py @@ -1,12 +1,20 @@ import logging import pytest +from pydantic import ValidationError -from graflo.architecture.schema.edge import Edge, EdgeConfig +from graflo.architecture.schema.edge import ( + DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME, + Edge, + EdgeConfig, +) from graflo.architecture.database_features import DatabaseProfile from graflo.architecture.schema import EdgeConfigDBAware, VertexConfigDBAware from graflo.architecture.graph_types import Weight from graflo.architecture.schema.vertex import VertexConfig +from graflo.architecture.contract.declarations.edge_derivation_registry import ( + EdgeDerivationRegistry, +) from graflo.onto import DBType logger = logging.getLogger(__name__) @@ -17,10 +25,15 @@ def test_weight_config_b(vertex_helper_b): assert len(wc.fields) == 2 -def test_init_edge(edge_with_weights): - edge = Edge.from_dict(edge_with_weights) - assert edge.weights is not None and len(edge.weights.vertices) == 2 - assert edge.identities == [] +def test_schema_edge_rejects_weights_key(): + with pytest.raises(ValidationError): + Edge.from_dict( + { + "source": "analyst", + "target": "agency", + "weights": {"direct": ["x"]}, + } + ) def test_init_edge_with_explicit_identities(): @@ -29,7 +42,7 @@ def test_init_edge_with_explicit_identities(): "source": "entity", "target": "entity", "identities": [["source", "target", "relation", "pub_id"]], - "weights": {"direct": ["pub_id"]}, + "properties": ["pub_id"], } ) assert edge.identities == [["source", "target", "relation", "pub_id"]] @@ -46,7 +59,8 @@ def test_edge_rejects_legacy_indexes_field(): ) -def test_edge_identities_require_declared_direct_fields(vertex_config_kg): +def test_edge_identities_merge_undeclared_tokens_into_properties(vertex_config_kg): + """Identity fields not listed under ``properties`` are added like vertex identity.""" vertex_config = VertexConfig.from_dict(vertex_config_kg) edge = Edge.from_dict( { @@ -55,8 +69,97 @@ def test_edge_identities_require_declared_direct_fields(vertex_config_kg): "identities": [["source", "target", "relation", "pub_id"]], } ) - with pytest.raises(ValueError, match="unknown identity fields"): - edge.finish_init(vertex_config) + edge.finish_init(vertex_config) + assert "pub_id" in edge.property_names + assert "relation" in edge.property_names + + +def test_compile_identity_indexes_registers_each_identity_key(vertex_config_kg): + vertex_config = VertexConfig.from_dict(vertex_config_kg) + edge = Edge.from_dict( + { + "source": "entity", + "target": "entity", + "identities": [ + ["source", "target", "pub_id"], + ["source", "target", "kind"], + ], + } + ) + edge.finish_init(vertex_config) + profile = DatabaseProfile(db_flavor=DBType.ARANGO) + vc_db = VertexConfigDBAware(vertex_config, profile) + ec_db = EdgeConfigDBAware(EdgeConfig(edges=[edge]), vc_db, profile) + ec_db.compile_identity_indexes() + indexes = profile.edge_secondary_indexes(edge.edge_id) + field_sets = {tuple(ix.fields) for ix in indexes} + assert field_sets == {("_from", "_to", "pub_id"), ("_from", "_to", "kind")} + assert all(ix.unique for ix in indexes) + + +def test_compile_identity_indexes_arango_prepends_from_to_when_identity_omits_endpoints( + vertex_config_kg, +): + """Relationship-only identity tokens still get _from/_to on Arango unique indexes.""" + vertex_config = VertexConfig.from_dict(vertex_config_kg) + edge = Edge.from_dict( + { + "source": "mention", + "target": "mention", + "identities": [["_role"]], + } + ) + edge.finish_init(vertex_config) + profile = DatabaseProfile(db_flavor=DBType.ARANGO) + vc_db = VertexConfigDBAware(vertex_config, profile) + ec_db = EdgeConfigDBAware(EdgeConfig(edges=[edge]), vc_db, profile) + ec_db.compile_identity_indexes() + indexes = profile.edge_secondary_indexes(edge.edge_id) + assert len(indexes) == 1 + assert tuple(indexes[0].fields) == ("_from", "_to", "_role") + assert indexes[0].unique is True + + +def test_compile_identity_indexes_neo4j_property_indexes_not_globally_unique( + vertex_config_kg, +): + """LPG backends cannot encode endpoints in rel property constraints; no bogus UNIQUE.""" + vertex_config = VertexConfig.from_dict(vertex_config_kg) + edge = Edge.from_dict( + { + "source": "mention", + "target": "mention", + "identities": [["_role"]], + } + ) + edge.finish_init(vertex_config) + profile = DatabaseProfile(db_flavor=DBType.NEO4J) + vc_db = VertexConfigDBAware(vertex_config, profile) + ec_db = EdgeConfigDBAware(EdgeConfig(edges=[edge]), vc_db, profile) + ec_db.compile_identity_indexes() + indexes = profile.edge_secondary_indexes(edge.edge_id) + assert len(indexes) == 1 + assert tuple(indexes[0].fields) == ("_role",) + assert indexes[0].unique is False + + +def test_relationship_merge_property_names_uses_first_identity_only(vertex_config_kg): + vertex_config = VertexConfig.from_dict(vertex_config_kg) + edge = Edge.from_dict( + { + "source": "entity", + "target": "entity", + "identities": [ + ["source", "target", "relation", "pub_id"], + ["source", "target", "kind"], + ], + } + ) + edge.finish_init(vertex_config) + profile = DatabaseProfile(db_flavor=DBType.NEO4J) + vc_db = VertexConfigDBAware(vertex_config, profile) + ec_db = EdgeConfigDBAware(EdgeConfig(edges=[edge]), vc_db, profile) + assert ec_db.relationship_merge_property_names(edge) == ["relation", "pub_id"] def test_edge_config(vertex_config_kg, edge_config_kg): @@ -81,7 +184,7 @@ def test_edge_finish_init_is_idempotent(vertex_config_kg): "source": "entity", "target": "entity", "identities": [["source", "target", "relation", "pub_id"]], - "weights": {"direct": ["pub_id"]}, + "properties": ["pub_id"], } ) edge.finish_init(vertex_config) @@ -101,13 +204,16 @@ def test_edge_finish_init_tigergraph_relation_artifacts_are_not_duplicated( { "source": "entity", "target": "entity", - "relation_from_key": True, } ) + e.finish_init(vertex_config) + ec = EdgeConfig(edges=[e]) db_features = DatabaseProfile(db_flavor=DBType.TIGERGRAPH) vc_db = VertexConfigDBAware(vertex_config, db_features) - ec_db = EdgeConfigDBAware(EdgeConfig(edges=[e]), vc_db, db_features) + overlay = EdgeDerivationRegistry() + overlay.mark_relation_from_key(e.edge_id) + ec_db = EdgeConfigDBAware(ec, vc_db, db_features, ingestion_overlay=overlay) first_weights = ec_db.effective_weights(e) second_weights = ec_db.effective_weights(e) first_direct_names = list( @@ -120,6 +226,60 @@ def test_edge_finish_init_tigergraph_relation_artifacts_are_not_duplicated( assert second_direct_names == first_direct_names +def test_schema_edge_rejects_relation_field(): + with pytest.raises(ValidationError): + Edge.from_dict( + { + "source": "entity", + "target": "entity", + "relation_field": "rel", + } + ) + + +def test_tigergraph_effective_weights_adds_default_relation_attr(vertex_config_kg): + vertex_config = VertexConfig.from_dict(vertex_config_kg) + edge = Edge.from_dict( + { + "source": "entity", + "target": "entity", + "properties": ["date"], + } + ) + edge.finish_init(vertex_config) + db_features = DatabaseProfile(db_flavor=DBType.TIGERGRAPH) + vc_db = VertexConfigDBAware(vertex_config, db_features) + ec_db = EdgeConfigDBAware(EdgeConfig(edges=[edge]), vc_db, db_features) + w = ec_db.effective_weights(edge) + assert w is not None + assert DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME in w.direct_names + assert "date" in w.direct_names + rt = ec_db.runtime(edge) + assert rt.effective_relation_field == DEFAULT_TIGERGRAPH_RELATION_WEIGHTNAME + assert rt.store_extracted_relation_as_weight is True + + +def test_tigergraph_runtime_fixed_relation_has_no_relation_attr(vertex_config_kg): + vertex_config = VertexConfig.from_dict(vertex_config_kg) + edge = Edge.from_dict( + { + "source": "entity", + "target": "entity", + "relation": "KNOWS", + "properties": ["date"], + } + ) + edge.finish_init(vertex_config) + db_features = DatabaseProfile(db_flavor=DBType.TIGERGRAPH) + vc_db = VertexConfigDBAware(vertex_config, db_features) + ec_db = EdgeConfigDBAware(EdgeConfig(edges=[edge]), vc_db, db_features) + rt = ec_db.runtime(edge) + assert rt.effective_relation_field is None + assert rt.store_extracted_relation_as_weight is False + w = ec_db.effective_weights(edge) + assert w is not None and w.direct_names == ["date"] + + def test_relationship_merge_property_names_defaults_to_direct_weights( vertex_config_kg, ): @@ -128,7 +288,7 @@ def test_relationship_merge_property_names_defaults_to_direct_weights( { "source": "entity", "target": "entity", - "weights": {"direct": ["date", "relation"]}, + "properties": ["date", "relation"], } ) edge.finish_init(vertex_config) @@ -147,7 +307,7 @@ def test_relationship_merge_property_names_prefers_first_identity( "source": "entity", "target": "entity", "identities": [["source", "target", "relation", "pub_id"]], - "weights": {"direct": ["pub_id"]}, + "properties": ["pub_id"], } ) edge.finish_init(vertex_config) diff --git a/test/architecture/test_graph_target_namespace.py b/test/architecture/test_graph_target_namespace.py new file mode 100644 index 00000000..c32fa0a3 --- /dev/null +++ b/test/architecture/test_graph_target_namespace.py @@ -0,0 +1,74 @@ +"""Tests for GraphEngine target LPG namespace resolution.""" + +from graflo.architecture.contract.manifest import GraphManifest +from graflo.db import ArangoConfig, MemgraphConfig, NebulaConfig, TigergraphConfig +from graflo.hq.graph_engine import ( + _ensure_graph_target_namespace, + _resolve_graph_target_namespace, +) + + +def _minimal_manifest(metadata_name: str = "logical_schema") -> GraphManifest: + return GraphManifest.from_config( + { + "schema": { + "metadata": {"name": metadata_name}, + "core_schema": { + "vertex_config": { + "vertices": [ + {"name": "v", "properties": ["id"], "identity": ["id"]} + ] + }, + "edge_config": {"edges": []}, + }, + }, + "ingestion_model": {"resources": []}, + } + ) + + +def test_resolve_prefers_call_arg_then_profile_then_metadata() -> None: + manifest = _minimal_manifest("meta_only") + schema = manifest.require_schema() + assert _resolve_graph_target_namespace(schema, "call") == "call" + schema.db_profile.target_namespace = "profile" + assert _resolve_graph_target_namespace(schema, "call2") == "call2" + assert _resolve_graph_target_namespace(schema, None) == "profile" + schema.db_profile.target_namespace = None + assert _resolve_graph_target_namespace(schema, None) == "meta_only" + + +def test_ensure_sets_arango_database_when_unset() -> None: + manifest = _minimal_manifest("mygraph") + schema = manifest.require_schema() + cfg = ArangoConfig(uri="http://localhost:8529", username="u", password="p") + assert cfg.database is None + _ensure_graph_target_namespace(schema, cfg, None) + assert cfg.database == "mygraph" + + +def test_ensure_sets_tigergraph_schema_name_when_unset() -> None: + manifest = _minimal_manifest("tg_graph") + schema = manifest.require_schema() + cfg = TigergraphConfig(uri="http://localhost:14240") + assert cfg.schema_name is None + _ensure_graph_target_namespace(schema, cfg, None) + assert cfg.schema_name == "tg_graph" + assert cfg.database is None + + +def test_ensure_sets_nebula_space_via_schema_name() -> None: + manifest = _minimal_manifest("space_a") + schema = manifest.require_schema() + cfg = NebulaConfig(uri="http://localhost:9669") + assert cfg.schema_name is None + _ensure_graph_target_namespace(schema, cfg, "space_explicit") + assert cfg.schema_name == "space_explicit" + + +def test_ensure_does_not_overwrite_memgraph_database_when_set() -> None: + manifest = _minimal_manifest("would_clobber") + schema = manifest.require_schema() + cfg = MemgraphConfig(uri="bolt://localhost:7687", database="already_set") + _ensure_graph_target_namespace(schema, cfg, None) + assert cfg.database == "already_set" diff --git a/test/architecture/test_manifest_canonical_contract.py b/test/architecture/test_manifest_canonical_contract.py index d5ea3060..1a0ca40d 100644 --- a/test/architecture/test_manifest_canonical_contract.py +++ b/test/architecture/test_manifest_canonical_contract.py @@ -13,8 +13,8 @@ def _minimal_schema() -> Schema: "core_schema": { "vertex_config": { "vertices": [ - {"name": "a", "fields": ["id"], "identity": ["id"]}, - {"name": "b", "fields": ["id"], "identity": ["id"]}, + {"name": "a", "properties": ["id"], "identity": ["id"]}, + {"name": "b", "properties": ["id"], "identity": ["id"]}, ] }, "edge_config": {"edges": []}, @@ -30,7 +30,7 @@ def test_manifest_minimal_canonical_roundtrip_is_idempotent() -> None: "core_schema": { "vertex_config": { "vertices": [ - {"name": "person", "fields": ["id"], "identity": ["id"]} + {"name": "person", "properties": ["id"], "identity": ["id"]} ] }, "edge_config": {"edges": []}, diff --git a/test/architecture/test_objects_relations.py b/test/architecture/test_objects_relations.py index 620a6de4..dcf19560 100644 --- a/test/architecture/test_objects_relations.py +++ b/test/architecture/test_objects_relations.py @@ -164,17 +164,17 @@ def _build_manifest_for_resource( "vertices": [ { "name": "person", - "fields": ["id"], + "properties": ["id"], "identity": ["id"], }, { "name": "vehicle", - "fields": ["id"], + "properties": ["id"], "identity": ["id"], }, { "name": "institution", - "fields": ["id"], + "properties": ["id"], "identity": ["id"], }, ] diff --git a/test/architecture/test_onto.py b/test/architecture/test_onto.py index 1793b508..5490d06b 100644 --- a/test/architecture/test_onto.py +++ b/test/architecture/test_onto.py @@ -1,5 +1,5 @@ from graflo.architecture.pipeline.runtime.actor import ActorInitContext, ActorWrapper -from graflo.architecture.schema.edge import EdgeConfig +from graflo.architecture.schema.edge import Edge, EdgeConfig from graflo.architecture.pipeline.runtime.executor import ActorExecutor from graflo.architecture.graph_types import ( AssemblyContext, @@ -30,9 +30,7 @@ def test_extraction_context_record_helpers(): vertex={"id": "a"}, ctx={"full_name": "A"}, ) - ctx.record_edge_intent( - edge={"source": "author", "target": "paper"}, location=lindex - ) + ctx.record_edge_intent(edge=Edge(source="author", target="paper"), location=lindex) assert len(ctx.transform_observations) == 1 assert len(ctx.vertex_observations) == 1 @@ -50,7 +48,7 @@ def test_assembly_context_from_extraction_shares_vertex_accumulator(): def test_actor_executor_assemble_result_returns_graph_result(): vc = VertexConfig.from_dict( - {"vertices": [{"name": "author", "fields": ["id"], "identity": ["id"]}]} + {"vertices": [{"name": "author", "properties": ["id"], "identity": ["id"]}]} ) ec = EdgeConfig.from_dict({"edges": []}) diff --git a/test/architecture/test_resource.py b/test/architecture/test_resource.py index 154e06ae..5be48ea7 100644 --- a/test/architecture/test_resource.py +++ b/test/architecture/test_resource.py @@ -55,7 +55,7 @@ def test_resource_drop_trivial_input_fields_passes_stripped_doc_to_executor( "vertices": [ { "name": "person", - "fields": ["id", "note"], + "properties": ["id", "note"], "identity": ["id"], } ] @@ -93,7 +93,7 @@ def test_resource_drop_trivial_input_fields_false_passes_doc_unchanged( "vertices": [ { "name": "person", - "fields": ["id"], + "properties": ["id"], "identity": ["id"], } ] @@ -150,7 +150,7 @@ def test_resource_infer_edge_selector_references_unknown_edge(): } ) vc = VertexConfig.from_dict( - {"vertices": [{"name": "person", "fields": ["id"], "identity": ["id"]}]} + {"vertices": [{"name": "person", "properties": ["id"], "identity": ["id"]}]} ) ec = EdgeConfig.from_dict({"edges": [{"source": "person", "target": "person"}]}) with pytest.raises(ValueError, match="undefined vertices"): @@ -168,7 +168,7 @@ def test_resource_dynamic_edge_vertices_must_be_declared(): } ) vc = VertexConfig.from_dict( - {"vertices": [{"name": "person", "fields": ["id"], "identity": ["id"]}]} + {"vertices": [{"name": "person", "properties": ["id"], "identity": ["id"]}]} ) ec = EdgeConfig.from_dict({"edges": []}) @@ -200,9 +200,9 @@ def test_resource_infer_edge_except_excludes_edges_handled_by_edge_actors(): vc = VertexConfig.from_dict( { "vertices": [ - {"name": "a", "fields": ["id"], "identity": ["id"]}, - {"name": "b", "fields": ["id"], "identity": ["id"]}, - {"name": "c", "fields": ["id"], "identity": ["id"]}, + {"name": "a", "properties": ["id"], "identity": ["id"]}, + {"name": "b", "properties": ["id"], "identity": ["id"]}, + {"name": "c", "properties": ["id"], "identity": ["id"]}, ] } ) diff --git a/test/architecture/test_resource_dynamic.py b/test/architecture/test_resource_dynamic.py index a3ce3349..157f7081 100644 --- a/test/architecture/test_resource_dynamic.py +++ b/test/architecture/test_resource_dynamic.py @@ -223,15 +223,15 @@ def _build_schema(self) -> Schema: "vertices": [ { "name": "server", - "fields": ["id", "class_name", "description"], + "properties": ["id", "class_name", "description"], }, { "name": "database", - "fields": ["id", "class_name", "description"], + "properties": ["id", "class_name", "description"], }, { "name": "network", - "fields": ["id", "class_name", "description"], + "properties": ["id", "class_name", "description"], }, ], }, @@ -538,10 +538,10 @@ def test_vertex_router_with_vertex_from_map_maps_doc_fields_to_vertex_fields( "vertices": [ { "name": "person", - "fields": ["id", "name"], + "properties": ["id", "name"], "identity": ["id"], }, - {"name": "org", "fields": ["id", "name"], "identity": ["id"]}, + {"name": "org", "properties": ["id", "name"], "identity": ["id"]}, ], }, edge_config={"edges": []}, @@ -593,7 +593,7 @@ def test_vertex_router_with_transform_consumes_transform_output(self): "vertices": [ { "name": "item", - "fields": ["id", "label"], + "properties": ["id", "label"], "identity": ["id"], }, ], diff --git a/test/architecture/test_resource_filters.py b/test/architecture/test_resource_filters.py index 0e21ad0c..67ca5112 100644 --- a/test/architecture/test_resource_filters.py +++ b/test/architecture/test_resource_filters.py @@ -289,8 +289,8 @@ def _make_schema_and_patterns(self): "core_schema": { "vertex_config": { "vertices": [ - {"name": "server", "fields": ["id", "class_name"]}, - {"name": "database", "fields": ["id", "class_name"]}, + {"name": "server", "properties": ["id", "class_name"]}, + {"name": "database", "properties": ["id", "class_name"]}, ], }, "edge_config": { diff --git a/test/architecture/test_schema_config_contract.py b/test/architecture/test_schema_config_contract.py index 5d0f5984..2f5801ac 100644 --- a/test/architecture/test_schema_config_contract.py +++ b/test/architecture/test_schema_config_contract.py @@ -8,7 +8,7 @@ def _minimal_graph() -> dict: return { "vertex_config": { - "vertices": [{"name": "person", "fields": ["id"], "identity": ["id"]}] + "vertices": [{"name": "person", "properties": ["id"], "identity": ["id"]}] }, "edge_config": {"edges": []}, } @@ -117,7 +117,7 @@ def test_schema_rejects_edges_with_undefined_vertices(): "core_schema": { "vertex_config": { "vertices": [ - {"name": "user", "fields": ["id"], "identity": ["id"]}, + {"name": "user", "properties": ["id"], "identity": ["id"]}, ] }, "edge_config": { diff --git a/test/architecture/test_vertex.py b/test/architecture/test_vertex.py index e030b46f..20493a75 100644 --- a/test/architecture/test_vertex.py +++ b/test/architecture/test_vertex.py @@ -83,31 +83,28 @@ def test_field_dict_membership(): def test_vertex_with_string_fields_backward_compatible(): """Test Vertex creation with list of strings (backward compatible).""" - vertex = Vertex(name="user", fields=["id", "name", "email"]) # type: ignore[arg-type] + vertex = Vertex(name="user", properties=["id", "name", "email"]) # type: ignore[arg-type] - assert len(vertex.fields) == 3 - assert all(isinstance(f, Field) for f in vertex.fields) - assert vertex.fields[0].name == "id" - assert vertex.fields[0].type is None # Defaults to None - assert vertex.fields[1].name == "name" - assert vertex.fields[2].name == "email" + assert len(vertex.properties) == 3 + assert all(isinstance(f, Field) for f in vertex.properties) + assert vertex.properties[0].name == "id" + assert vertex.properties[0].type is None # Defaults to None + assert vertex.properties[1].name == "name" + assert vertex.properties[2].name == "email" - # field_names property - assert vertex.field_names == ["id", "name", "email"] + assert vertex.property_names == ["id", "name", "email"] # Fields work in sets - fields_set = set(vertex.fields) + fields_set = set(vertex.properties) assert len(fields_set) == 3 def test_vertex_with_string_fields_dict_compatibility(): - """Test that field_names property works for dict lookups (critical for backward compatibility).""" - vertex = Vertex(name="user", fields=["id", "name"]) # type: ignore[arg-type] + """Test that property_names works for dict lookups.""" + vertex = Vertex(name="user", properties=["id", "name"]) # type: ignore[arg-type] test_dict = {"id": 1, "name": "John", "other": "ignored"} - # This is the clean usage pattern from actor_util.py - # Use field_names property directly - much cleaner than str(f) - result = {f: test_dict[f] for f in vertex.field_names if f in test_dict} + result = {f: test_dict[f] for f in vertex.property_names if f in test_dict} assert result == {"id": 1, "name": "John"} @@ -119,13 +116,13 @@ def test_vertex_with_field_objects(): Field(name="age", type=FieldType.INT), Field(name="active", type=FieldType.BOOL), ] - vertex = Vertex(name="user", fields=fields) + vertex = Vertex(name="user", properties=fields) - assert len(vertex.fields) == 4 - assert vertex.fields[0].name == "id" - assert vertex.fields[0].type == FieldType.INT - assert vertex.fields[1].type == FieldType.STRING - assert vertex.fields[3].type == FieldType.BOOL + assert len(vertex.properties) == 4 + assert vertex.properties[0].name == "id" + assert vertex.properties[0].type == FieldType.INT + assert vertex.properties[1].type == FieldType.STRING + assert vertex.properties[3].type == FieldType.BOOL def test_vertex_with_dict_fields(): @@ -135,14 +132,14 @@ def test_vertex_with_dict_fields(): {"name": "name", "type": "STRING"}, {"name": "email"}, # No type specified, defaults to None ] - vertex = Vertex(name="user", fields=fields) # type: ignore[arg-type] + vertex = Vertex(name="user", properties=fields) # type: ignore[arg-type] - assert len(vertex.fields) == 3 - assert vertex.fields[0].name == "id" - assert vertex.fields[0].type == FieldType.INT - assert vertex.fields[1].type == FieldType.STRING - assert vertex.fields[2].name == "email" - assert vertex.fields[2].type is None + assert len(vertex.properties) == 3 + assert vertex.properties[0].name == "id" + assert vertex.properties[0].type == FieldType.INT + assert vertex.properties[1].type == FieldType.STRING + assert vertex.properties[2].name == "email" + assert vertex.properties[2].type is None def test_vertex_mixed_field_inputs(): @@ -152,72 +149,66 @@ def test_vertex_mixed_field_inputs(): Field(name="name", type=FieldType.STRING), # Field object {"name": "email", "type": "STRING"}, # dict ] - vertex = Vertex(name="user", fields=fields) # type: ignore[arg-type] + vertex = Vertex(name="user", properties=fields) # type: ignore[arg-type] - assert len(vertex.fields) == 3 - assert all(isinstance(f, Field) for f in vertex.fields) - assert vertex.fields[0].name == "id" - assert vertex.fields[0].type is None - assert vertex.fields[1].name == "name" - assert vertex.fields[1].type == FieldType.STRING - assert vertex.fields[2].name == "email" - assert vertex.fields[2].type == FieldType.STRING + assert len(vertex.properties) == 3 + assert all(isinstance(f, Field) for f in vertex.properties) + assert vertex.properties[0].name == "id" + assert vertex.properties[0].type is None + assert vertex.properties[1].name == "name" + assert vertex.properties[1].type == FieldType.STRING + assert vertex.properties[2].name == "email" + assert vertex.properties[2].type == FieldType.STRING -def test_vertex_config_fields_backward_compatible(): - """Test VertexConfig.fields_names() method returns names (backward compatible).""" - vertex = Vertex(name="user", fields=["id", "name", "email"]) # type: ignore[arg-type] +def test_vertex_config_property_names(): + """Test VertexConfig.property_names() returns string names.""" + vertex = Vertex(name="user", properties=["id", "name", "email"]) # type: ignore[arg-type] config = VertexConfig(vertices=[vertex]) - # fields_names() returns names (strings) for backward compatibility - # Order may vary, so check membership and length - fields = config.fields_names("user") - assert len(fields) == 3 - assert all(isinstance(f, str) for f in fields) - assert set(fields) == {"id", "name", "email"} - # Check that order is preserved from original fields - assert fields == ["id", "name", "email"] + names = config.property_names("user") + assert len(names) == 3 + assert all(isinstance(f, str) for f in names) + assert set(names) == {"id", "name", "email"} + assert names == ["id", "name", "email"] -def test_vertex_config_fields_with_objects(): - """Test VertexConfig.fields() returns Field objects, fields_names() returns strings.""" +def test_vertex_config_properties_with_objects(): + """Test VertexConfig.properties() returns Field objects; property_names() returns strings.""" vertex = Vertex( name="user", - fields=[ + properties=[ Field(name="id", type=FieldType.INT), Field(name="name", type=FieldType.STRING), ], ) config = VertexConfig(vertices=[vertex]) - # fields() returns Field objects - fields = config.fields("user") - assert len(fields) == 2 - assert all(isinstance(f, Field) for f in fields) - assert fields[0].type == FieldType.INT - assert fields[1].type == FieldType.STRING + props = config.properties("user") + assert len(props) == 2 + assert all(isinstance(f, Field) for f in props) + assert props[0].type == FieldType.INT + assert props[1].type == FieldType.STRING - # fields_names() returns strings - field_names = config.fields_names("user") - assert field_names == ["id", "name"] + assert config.property_names("user") == ["id", "name"] def test_vertex_from_dict_with_string_fields(): """Test Vertex.from_dict() with string fields (backward compatible).""" - vertex_dict = {"name": "user", "fields": ["id", "name", "email"]} + vertex_dict = {"name": "user", "properties": ["id", "name", "email"]} vertex = Vertex.from_dict(vertex_dict) assert vertex.name == "user" - assert len(vertex.fields) == 3 - assert all(isinstance(f, Field) for f in vertex.fields) - assert all(f.type is None for f in vertex.fields) + assert len(vertex.properties) == 3 + assert all(isinstance(f, Field) for f in vertex.properties) + assert all(f.type is None for f in vertex.properties) def test_vertex_from_dict_with_typed_fields(): """Test Vertex.from_dict() with typed fields in dict.""" vertex_dict = { "name": "user", - "fields": [ + "properties": [ {"name": "id", "type": "INT"}, {"name": "name", "type": "STRING"}, {"name": "email"}, @@ -226,17 +217,17 @@ def test_vertex_from_dict_with_typed_fields(): vertex = Vertex.from_dict(vertex_dict) assert vertex.name == "user" - assert len(vertex.fields) == 3 - assert vertex.fields[0].type == FieldType.INT - assert vertex.fields[1].type == FieldType.STRING - assert vertex.fields[2].type is None + assert len(vertex.properties) == 3 + assert vertex.properties[0].type == FieldType.INT + assert vertex.properties[1].type == FieldType.STRING + assert vertex.properties[2].type is None def test_vertex_identity_defaults_to_fields(): """Test that identity defaults to all fields when not specified.""" vertex = Vertex( name="user", - fields=[ + properties=[ Field(name="id", type=FieldType.INT), Field(name="email", type=FieldType.STRING), ], @@ -245,23 +236,23 @@ def test_vertex_identity_defaults_to_fields(): assert vertex.identity == ["id", "email"] # Field objects should still be accessible - assert len(vertex.fields) == 2 - assert vertex.fields[0].type == FieldType.INT + assert len(vertex.properties) == 2 + assert vertex.properties[0].type == FieldType.INT def test_vertex_with_explicit_identity(): """Test vertex with explicit identity fields.""" vertex = Vertex( name="user", - fields=["id", "name", "email"], # type: ignore[arg-type] + properties=["id", "name", "email"], # type: ignore[arg-type] identity=["id", "email"], ) assert vertex.identity == ["id", "email"] - field_names = vertex.field_names - assert "id" in field_names - assert "name" in field_names - assert "email" in field_names + names = vertex.property_names + assert "id" in names + assert "name" in names + assert "email" in names def test_field_all_types(): @@ -276,7 +267,7 @@ def test_invalid_field_type_in_dict(): with pytest.raises(ValueError, match="not allowed"): Vertex( name="user", - fields=[{"name": "test", "type": "INVALID"}], # type: ignore[arg-type] + properties=[{"name": "test", "type": "INVALID"}], # type: ignore[arg-type] ) @@ -285,17 +276,17 @@ def test_init(vertex_pub): vc = Vertex.from_dict(vertex_pub) assert vc.identity == ["arxiv", "doi", "created", "data_source"] # Fields are now Field objects, so check count - assert len(vc.fields) == 4 + assert len(vc.properties) == 4 # Verify they're Field objects - assert all(isinstance(f, Field) for f in vc.fields) + assert all(isinstance(f, Field) for f in vc.properties) -def test_get_fields_with_defaults_tigergraph(): - """Test DB-aware vertex fields default None types to STRING for TigerGraph.""" +def test_get_properties_with_defaults_tigergraph(): + """DB-aware vertex properties default None types to STRING for TigerGraph.""" # Create vertex with some fields that have None type vertex = Vertex( name="user", - fields=[ # type: ignore[arg-type] + properties=[ # type: ignore[arg-type] Field(name="id", type=FieldType.INT), # Already has type Field(name="name"), # None type Field(name="email", type=FieldType.STRING), # Already has type @@ -308,23 +299,23 @@ def test_get_fields_with_defaults_tigergraph(): logical=config, database_features=DatabaseProfile(db_flavor=DBType.TIGERGRAPH), ) - fields = db_cfg.fields("user") - assert len(fields) == 4 - assert fields[0].name == "id" - assert fields[0].type == "INT" - assert fields[1].name == "name" - assert fields[1].type == "STRING" # Default applied - assert fields[2].name == "email" - assert fields[2].type == "STRING" - assert fields[3].name == "address" - assert fields[3].type == "STRING" # Default applied - - -def test_get_fields_with_defaults_other_db(): - """Test DB-aware vertex fields preserve None types for non-TigerGraph DBs.""" + props = db_cfg.properties("user") + assert len(props) == 4 + assert props[0].name == "id" + assert props[0].type == "INT" + assert props[1].name == "name" + assert props[1].type == "STRING" # Default applied + assert props[2].name == "email" + assert props[2].type == "STRING" + assert props[3].name == "address" + assert props[3].type == "STRING" # Default applied + + +def test_get_properties_with_defaults_other_db(): + """DB-aware vertex properties preserve None types for non-TigerGraph DBs.""" vertex = Vertex( name="user", - fields=[ + properties=[ Field(name="id", type=FieldType.INT), Field(name="name"), # None type ], @@ -335,70 +326,67 @@ def test_get_fields_with_defaults_other_db(): logical=config, database_features=DatabaseProfile(db_flavor=DBType.ARANGO), ) - fields = db_cfg.fields("user") - assert len(fields) == 2 - assert fields[0].type == "INT" - assert fields[1].name == "name" - assert fields[1].type is None # Preserved + props = db_cfg.properties("user") + assert len(props) == 2 + assert props[0].type == "INT" + assert props[1].name == "name" + assert props[1].type is None # Preserved db_cfg = VertexConfigDBAware( logical=config, database_features=DatabaseProfile(db_flavor=DBType.NEO4J), ) - fields = db_cfg.fields("user") - assert fields[1].type is None # Preserved + props = db_cfg.properties("user") + assert props[1].type is None # Preserved -def test_get_fields_with_defaults_none(): - """Test logical vertex fields are preserved by default.""" +def test_get_properties_with_defaults_none(): + """Logical vertex properties are preserved by default.""" vertex = Vertex( name="user", - fields=[ + properties=[ Field(name="id", type=FieldType.INT), Field(name="name"), # None type ], ) - fields = vertex.get_fields() - assert len(fields) == 2 - assert fields[0].type == "INT" - assert fields[1].type is None # Preserved + props = vertex.get_properties() + assert len(props) == 2 + assert props[0].type == "INT" + assert props[1].type is None # Preserved -def test_vertex_config_fields_with_db_flavor(): - """Test DB-aware VertexConfig wrapper applies DB-specific defaults.""" +def test_vertex_config_properties_with_db_flavor(): + """DB-aware VertexConfig wrapper applies DB-specific defaults.""" vertex = Vertex( name="user", - fields=[ + properties=[ Field(name="id", type=FieldType.INT), Field(name="name"), # None type ], ) config = VertexConfig(vertices=[vertex]) - # With ArangoDB, None types should remain None - fields = config.fields("user") - assert fields[1].type is None # Preserved + props = config.properties("user") + assert props[1].type is None # Preserved db_cfg = VertexConfigDBAware( logical=config, database_features=DatabaseProfile(db_flavor=DBType.TIGERGRAPH), ) - fields = db_cfg.fields("user") - assert len(fields) == 2 - assert fields[0].type == "INT" - assert fields[1].type == "STRING" # Default applied + props = db_cfg.properties("user") + assert len(props) == 2 + assert props[0].type == "INT" + assert props[1].type == "STRING" # Default applied - # fields_names() returns strings - field_names = config.fields_names("user") - assert field_names == ["id", "name"] + assert config.property_names("user") == ["id", "name"] def test_vertex_config_remove_vertices(): """Test VertexConfig.remove_vertices removes vertices and updates blank_vertices.""" - v1 = Vertex.from_dict({"name": "a", "fields": ["id"]}) - v2 = Vertex.from_dict({"name": "b", "fields": ["id"]}) - v3 = Vertex.from_dict({"name": "c", "fields": ["id"]}) + v1 = Vertex.from_dict({"name": "a", "properties": ["id"]}) + v2 = Vertex.from_dict({"name": "b", "properties": ["id"]}) + v3 = Vertex.from_dict({"name": "c", "properties": ["id"]}) config = VertexConfig( vertices=[v1, v2, v3], blank_vertices=["b"], diff --git a/test/architecture/test_weights.py b/test/architecture/test_weights.py index 867080e0..844f0604 100644 --- a/test/architecture/test_weights.py +++ b/test/architecture/test_weights.py @@ -3,7 +3,7 @@ from graflo.architecture.pipeline.runtime.actor import ActorInitContext, ActorWrapper from graflo.architecture.schema.edge import EdgeConfig -from graflo.architecture.schema.edge import WeightConfig +from graflo.architecture.schema.db_aware import WeightConfig from graflo.architecture.graph_types import ActionContext from graflo.architecture.schema.vertex import Field, FieldType @@ -135,8 +135,8 @@ def test_weight_config_direct_names_property(): assert all(isinstance(n, str) for n in names) -def test_weight_config_direct_backward_compatibility(): - """Test that Field objects in direct behave like strings for backward compatibility.""" +def test_weight_config_direct_field_string_like_behavior(): + """Field objects in WeightConfig.direct support iteration and str-like use.""" wc = WeightConfig(direct=["date", "weight"]) # type: ignore[arg-type] # Test iteration (used in actor_util.py) diff --git a/test/config/schema/csv-edge-weights.yaml b/test/config/schema/csv-edge-weights.yaml index 96b783fa..387c2227 100644 --- a/test/config/schema/csv-edge-weights.yaml +++ b/test/config/schema/csv-edge-weights.yaml @@ -5,7 +5,7 @@ schema: vertex_config: vertices: - name: company - fields: + properties: - name identity: - name @@ -13,10 +13,8 @@ schema: edges: - source: company target: company - relation_field: relation - weights: - direct: - - date + properties: + - date db_profile: {} ingestion_model: resources: @@ -28,4 +26,7 @@ ingestion_model: - vertex: company from: name: company_b + - source: company + target: company + relation_field: relation bindings: {} diff --git a/test/config/schema/debian-eco.yaml b/test/config/schema/debian-eco.yaml index 32a5a138..74a0102b 100644 --- a/test/config/schema/debian-eco.yaml +++ b/test/config/schema/debian-eco.yaml @@ -6,19 +6,19 @@ schema: vertex_config: vertices: - name: package - fields: + properties: - name - version identity: - name - name: maintainer - fields: + properties: - name - email identity: - email - name: bug - fields: + properties: - id - subject - severity diff --git a/test/config/schema/ibes.yaml b/test/config/schema/ibes.yaml index d399739c..0bbcefd1 100644 --- a/test/config/schema/ibes.yaml +++ b/test/config/schema/ibes.yaml @@ -7,13 +7,13 @@ schema: - publication vertices: - name: publication - fields: + properties: - datetime_review - datetime_announce identity: - _key - name: ticker - fields: + properties: - cusip - cname - oftic @@ -22,19 +22,19 @@ schema: - cname - oftic - name: agency - fields: + properties: - aname identity: - aname - name: analyst - fields: + properties: - last_name - initial identity: - last_name - initial - name: recommendation - fields: + properties: - erec - etext - irec @@ -50,13 +50,9 @@ schema: target: ticker - source: analyst target: agency - weights: - vertices: - - name: publication - keep_vertex_name: false - fields: - - datetime_review - - datetime_announce + properties: + - datetime_review + - datetime_announce - source: analyst target: publication - source: publication @@ -129,4 +125,12 @@ ingestion_model: IRECCD: irec ITEXT: itext - vertex: publication + - source: analyst + target: agency + vertex_weights: + - name: publication + keep_vertex_name: false + fields: + - datetime_review + - datetime_announce bindings: {} diff --git a/test/config/schema/kg.yaml b/test/config/schema/kg.yaml index 192b392b..5e2b071a 100644 --- a/test/config/schema/kg.yaml +++ b/test/config/schema/kg.yaml @@ -6,16 +6,14 @@ schema: vertex_config: vertices: - name: publication - fields: - - arxiv - - doi + properties: - created - data_source identity: - arxiv - doi - name: entity - fields: + properties: - linker_type - ent_db_type - id @@ -25,12 +23,12 @@ schema: identity: - _key - name: mention - fields: + properties: - text identity: - _key - name: community - fields: + properties: - obs_date - t_window - comm_id @@ -47,13 +45,16 @@ schema: - - source - target - publication@_id - weights: - direct: - - publication@_id - - obs_date - - t_window + properties: + - publication@_id + - obs_date + - t_window - source: mention target: entity + - source: mention + target: mention + identities: + - - _role - source: entity target: community - source: community @@ -126,14 +127,6 @@ schema: - fields: - publication@_id unique: false - - source: mention - target: mention - purpose: - indexes: - - fields: - - _from - - _to - - _role ingestion_model: resources: - name: kg @@ -190,9 +183,8 @@ ingestion_model: match_source: triple_index target: mention match_target: triple - weights: - direct: - - _role + properties: + - _role - name: communities apply: - source: entity diff --git a/test/config/schema/oa-institution.yaml b/test/config/schema/oa-institution.yaml index 49d2048e..05234a17 100644 --- a/test/config/schema/oa-institution.yaml +++ b/test/config/schema/oa-institution.yaml @@ -5,7 +5,7 @@ schema: vertex_config: vertices: - name: institution - fields: + properties: - _key - display_name - country diff --git a/test/config/schema/oa.yaml b/test/config/schema/oa.yaml index a4f032aa..2ac0508e 100644 --- a/test/config/schema/oa.yaml +++ b/test/config/schema/oa.yaml @@ -5,17 +5,17 @@ schema: vertex_config: vertices: - name: author - fields: + properties: - _key identity: - _key - name: institution - fields: + properties: - _key identity: - _key - name: work - fields: + properties: - _key - doi identity: @@ -50,10 +50,9 @@ ingestion_model: edge: source: author target: institution - weights: - direct: - - updated_date - - created_date + properties: + - updated_date + - created_date - name: works root: children: diff --git a/test/config/schema/objects-relations.yaml b/test/config/schema/objects-relations.yaml index 2cff620c..e5535dea 100644 --- a/test/config/schema/objects-relations.yaml +++ b/test/config/schema/objects-relations.yaml @@ -5,7 +5,7 @@ schema: vertex_config: vertices: - name: person - fields: + properties: - id - name - age @@ -16,7 +16,7 @@ schema: identity: - id - name: vehicle - fields: + properties: - id - name - license_plate @@ -27,7 +27,7 @@ schema: identity: - id - name: institution - fields: + properties: - id - name - email diff --git a/test/config/schema/review-tigergraph-edges.yaml b/test/config/schema/review-tigergraph-edges.yaml index 991b90ae..1f5573b5 100644 --- a/test/config/schema/review-tigergraph-edges.yaml +++ b/test/config/schema/review-tigergraph-edges.yaml @@ -5,14 +5,14 @@ schema: vertex_config: vertices: - name: author - fields: + properties: - id - full_name - hindex identity: - id - name: researchField - fields: + properties: - id - name - level diff --git a/test/config/schema/review-tigergraph.yaml b/test/config/schema/review-tigergraph.yaml index f7a54820..0bcfb0f4 100644 --- a/test/config/schema/review-tigergraph.yaml +++ b/test/config/schema/review-tigergraph.yaml @@ -5,14 +5,14 @@ schema: vertex_config: vertices: - name: author - fields: + properties: - id - full_name - hindex identity: - id - name: researchField - fields: + properties: - id - name - level diff --git a/test/config/schema/review.yaml b/test/config/schema/review.yaml index 30966d64..e59fb8a5 100644 --- a/test/config/schema/review.yaml +++ b/test/config/schema/review.yaml @@ -5,7 +5,7 @@ schema: vertex_config: vertices: - name: author - fields: + properties: - id - full_name - hindex @@ -13,7 +13,7 @@ schema: - id - full_name - name: researchField - fields: + properties: - id - name - level diff --git a/test/config/schema/ticker.yaml b/test/config/schema/ticker.yaml index 40b8c90a..89ef7283 100644 --- a/test/config/schema/ticker.yaml +++ b/test/config/schema/ticker.yaml @@ -5,7 +5,7 @@ schema: vertex_config: vertices: - name: ticker - fields: + properties: - cusip - cname - oftic @@ -14,7 +14,7 @@ schema: - cname - oftic - name: feature - fields: + properties: - name - value identity: @@ -43,13 +43,8 @@ schema: edges: - source: ticker target: feature - weights: - direct: - - t_obs - vertices: - - name: feature - fields: - - name + properties: + - t_obs db_profile: vertex_storage_names: ticker: tickers @@ -120,4 +115,14 @@ ingestion_model: - transform: rename: __ticker: oftic + - vertex: ticker + - vertex: feature + - source: ticker + target: feature + properties: + - t_obs + vertex_weights: + - name: feature + fields: + - name bindings: {} diff --git a/test/config/schema/tigergraph-sanitize-edges.yaml b/test/config/schema/tigergraph-sanitize-edges.yaml index 8cdb4eb6..ec416f67 100644 --- a/test/config/schema/tigergraph-sanitize-edges.yaml +++ b/test/config/schema/tigergraph-sanitize-edges.yaml @@ -5,19 +5,19 @@ schema: vertex_config: vertices: - name: parcel - fields: + properties: - name: id type: INT identity: - id - name: box - fields: + properties: - name: id type: INT identity: - id - name: container - fields: + properties: - name identity: - name diff --git a/test/config/schema/tigergraph-sanitize.yaml b/test/config/schema/tigergraph-sanitize.yaml index 3b073e6d..e5618ecc 100644 --- a/test/config/schema/tigergraph-sanitize.yaml +++ b/test/config/schema/tigergraph-sanitize.yaml @@ -5,7 +5,7 @@ schema: vertex_config: vertices: - name: package - fields: + properties: - name: id type: INT - name: SELECT @@ -19,7 +19,7 @@ schema: identity: - id - name: users - fields: + properties: - name: id type: INT - name: name diff --git a/test/conftest.py b/test/conftest.py index cd8b1c85..58b364c2 100644 --- a/test/conftest.py +++ b/test/conftest.py @@ -262,100 +262,19 @@ def vertex_config_transform_collision(): - name: person dbname: people - fields: + properties: - id - name - name: pet dbname: pets - fields: + properties: - name """ ) return vc -@pytest.fixture() -def resource_with_dynamic_relations(): - vc = yaml.safe_load( - """ - general: - name: openalex - resources: - tree_likes: - - name: institutions - root: - children: - - type: vertex - name: institution - transforms: - - transform: - call: - use: keep_suffix_id - - transform: - call: - use: keep_suffix_id - input: - - ror - output: - - ror - - key: associated_institutions - children: - - type: vertex - name: institution - transforms: - - transform: - call: - use: keep_suffix_id - - type: edge - edge: - source: institution - target: institution - # relation: - vertex_config: - vertices: - - name: institution - dbname: institutions - fields: - - _key - - display_name - - country - - type - - ror - - grid - - wikidata - - mag - - created_date - - updated_date - indexes: - - fields: - - _key - - unique: false - type: fulltext - fields: - - display_name - - unique: false - fields: - - type - edge_config: - edges: [] - transforms: - - name: keep_suffix_id - foo: split_keep_part - module: graflo.util.transform - params: - sep: "/" - keep: -1 - input: - - id - output: - - _key - - """ - ) - return vc - - @pytest.fixture() def resource_openalex_works(): return yaml.safe_load(""" @@ -528,7 +447,15 @@ def resource_ticker(): rename: ticker: oftic - vertex: ticker - - vertex: feature + - vertex: feature + - source: ticker + target: feature + properties: + - t_obs + vertex_weights: + - name: feature + fields: + - name """) diff --git a/test/data_source/sparql/test_rdf_inference.py b/test/data_source/sparql/test_rdf_inference.py index 27ca8628..0809e875 100644 --- a/test/data_source/sparql/test_rdf_inference.py +++ b/test/data_source/sparql/test_rdf_inference.py @@ -25,13 +25,13 @@ def test_infer_schema_fields(self, sample_ontology_path: Path): mgr = RdfInferenceManager() schema, _ = mgr.infer_schema(sample_ontology_path, schema_name="test_rdf") - person_fields = schema.core_schema.vertex_config.fields_names("Person") + person_fields = schema.core_schema.vertex_config.property_names("Person") assert "name" in person_fields assert "age" in person_fields assert "_key" in person_fields assert "_uri" in person_fields - org_fields = schema.core_schema.vertex_config.fields_names("Organization") + org_fields = schema.core_schema.vertex_config.property_names("Organization") assert "orgName" in org_fields assert "founded" in org_fields diff --git a/test/db/connection/test_onto.py b/test/db/connection/test_onto.py index c056e8e6..29c578f6 100644 --- a/test/db/connection/test_onto.py +++ b/test/db/connection/test_onto.py @@ -176,7 +176,7 @@ def test_from_env_with_bolt_port(self, monkeypatch): monkeypatch.setenv("USER_NEO4J_USERNAME", "admin") # Load config with prefix - # Note: from_env with prefix returns a dynamically created subclass, but it has all Neo4jConfig attributes + # Note: from_env with prefix returns a dynamically created subclass, but it has all Neo4jConfig properties config = Neo4jConfig.from_env(prefix="USER") # Verify BOLT_PORT and username are read correctly diff --git a/test/db/nebulas/conftest.py b/test/db/nebulas/conftest.py index d3ebcda0..e25050b2 100644 --- a/test/db/nebulas/conftest.py +++ b/test/db/nebulas/conftest.py @@ -29,14 +29,14 @@ "vertices": [ { "name": "Person", - "fields": [ + "properties": [ {"name": "name", "type": "STRING"}, {"name": "age", "type": "INT"}, ], }, { "name": "City", - "fields": [ + "properties": [ {"name": "name", "type": "STRING"}, {"name": "population", "type": "INT"}, ], @@ -49,15 +49,11 @@ "source": "Person", "target": "City", "relation": "lives_in", - "match_source": "name", - "match_target": "name", }, { "source": "Person", "target": "Person", "relation": "knows", - "match_source": "name", - "match_target": "name", }, ], }, diff --git a/test/db/postgres/test_resource_mapping.py b/test/db/postgres/test_resource_mapping.py index fd19a58a..d74738e8 100644 --- a/test/db/postgres/test_resource_mapping.py +++ b/test/db/postgres/test_resource_mapping.py @@ -19,12 +19,16 @@ def _build_vertex_config() -> VertexConfig: vertices=[ Vertex( name="users", - fields=[Field(name="id"), Field(name="name"), Field(name="user_name")], + properties=[ + Field(name="id"), + Field(name="name"), + Field(name="user_name"), + ], identity=["id"], ), Vertex( name="products", - fields=[ + properties=[ Field(name="product_code"), Field(name="name"), Field(name="product_name"), diff --git a/test/db/postgres/test_schema_inference.py b/test/db/postgres/test_schema_inference.py index 65f9594d..efbcfc28 100644 --- a/test/db/postgres/test_schema_inference.py +++ b/test/db/postgres/test_schema_inference.py @@ -59,7 +59,7 @@ def test_infer_schema_from_postgres(conn_conf, load_mock_schema): users_vertex = next( v for v in schema.core_schema.vertex_config.vertices if v.name == "users" ) - field_names = [f.name for f in users_vertex.fields] + field_names = [f.name for f in users_vertex.properties] assert "id" in field_names, f"Expected 'id' in users fields, got {field_names}" assert "name" in field_names, f"Expected 'name' in users fields, got {field_names}" assert "email" in field_names, ( @@ -67,11 +67,11 @@ def test_infer_schema_from_postgres(conn_conf, load_mock_schema): ) # Verify field types (id should be INT, name/email should be STRING) - id_field = next(f for f in users_vertex.fields if f.name == "id") + id_field = next(f for f in users_vertex.properties if f.name == "id") assert id_field.type is not None, "id field should have a type" assert id_field.type == "INT", f"Expected id type to be INT, got {id_field.type}" - name_field = next(f for f in users_vertex.fields if f.name == "name") + name_field = next(f for f in users_vertex.properties if f.name == "name") assert name_field.type is not None, "name field should have a type" assert name_field.type == "STRING", ( f"Expected name type to be STRING, got {name_field.type}" @@ -79,7 +79,7 @@ def test_infer_schema_from_postgres(conn_conf, load_mock_schema): # Verify datetime field type (created_at should be DATETIME) created_at_field = next( - (f for f in users_vertex.fields if f.name == "created_at"), None + (f for f in users_vertex.properties if f.name == "created_at"), None ) if created_at_field: assert created_at_field.type is not None, "created_at field should have a type" @@ -106,12 +106,10 @@ def test_infer_schema_from_postgres(conn_conf, load_mock_schema): # Verify edge has weight configuration if applicable # (purchases might have quantity or price as weight) - if purchases_edge.weights: - # WeightConfig has 'direct' list for direct weights - assert ( - len(purchases_edge.weights.direct) > 0 - or len(purchases_edge.weights.vertices) > 0 - ), "purchases edge should have weights" + if purchases_edge.properties: + assert len(purchases_edge.properties) > 0, ( + "purchases edge should have attribute fields" + ) # Verify resources were created assert len(ingestion_model.resources) > 0, "IngestionModel should have resources" @@ -150,16 +148,16 @@ def test_infer_schema_from_postgres(conn_conf, load_mock_schema): print(f"\nVertices ({len(schema.core_schema.vertex_config.vertices)}):") for v in schema.core_schema.vertex_config.vertices: field_types = ", ".join( - [f"{f.name}:{f.type if f.type else 'None'}" for f in v.fields[:5]] + [f"{f.name}:{f.type if f.type else 'None'}" for f in v.properties[:5]] ) print(f" - {v.name}: {field_types}...") print(f"\nEdges ({len(schema.core_schema.edge_config._edges_map)}):") for edge_id, e in schema.core_schema.edge_config._edges_map.items(): weights_info = "" - if e.weights: - weight_count = len(e.weights.direct) + len(e.weights.vertices) - weights_info = f" (weights: {weight_count})" + if e.properties: + weight_count = len(e.properties) + weights_info = f" (properties: {weight_count})" relation_info = f" [{e.relation}]" if e.relation else "" print(f" - {edge_id}: {e.source} -> {e.target}{relation_info}{weights_info}") @@ -261,7 +259,7 @@ def test_infer_schema_with_pg_catalog_fallback(conn_conf, load_mock_schema): users_vertex = next( v for v in schema.core_schema.vertex_config.vertices if v.name == "users" ) - field_names = [f.name for f in users_vertex.fields] + field_names = [f.name for f in users_vertex.properties] assert "id" in field_names, ( f"Expected 'id' in users fields when using pg_catalog, got {field_names}" ) @@ -273,7 +271,7 @@ def test_infer_schema_with_pg_catalog_fallback(conn_conf, load_mock_schema): ) # Verify field types - should be correctly mapped via pg_catalog - id_field = next(f for f in users_vertex.fields if f.name == "id") + id_field = next(f for f in users_vertex.properties if f.name == "id") assert id_field.type is not None, ( "id field should have a type when using pg_catalog" ) @@ -281,7 +279,7 @@ def test_infer_schema_with_pg_catalog_fallback(conn_conf, load_mock_schema): f"Expected id type to be INT when using pg_catalog, got {id_field.type}" ) - name_field = next(f for f in users_vertex.fields if f.name == "name") + name_field = next(f for f in users_vertex.properties if f.name == "name") assert name_field.type is not None, ( "name field should have a type when using pg_catalog" ) @@ -358,16 +356,16 @@ def test_infer_schema_with_pg_catalog_fallback(conn_conf, load_mock_schema): print(f"\nVertices ({len(schema.core_schema.vertex_config.vertices)}):") for v in schema.core_schema.vertex_config.vertices: field_types = ", ".join( - [f"{f.name}:{f.type if f.type else 'None'}" for f in v.fields[:5]] + [f"{f.name}:{f.type if f.type else 'None'}" for f in v.properties[:5]] ) print(f" - {v.name}: {field_types}...") print(f"\nEdges ({len(schema.core_schema.edge_config._edges_map)}):") for edge_id, e in schema.core_schema.edge_config._edges_map.items(): weights_info = "" - if e.weights: - weight_count = len(e.weights.direct) + len(e.weights.vertices) - weights_info = f" (weights: {weight_count})" + if e.properties: + weight_count = len(e.properties) + weights_info = f" (properties: {weight_count})" relation_info = f" [{e.relation}]" if e.relation else "" print( f" - {edge_id}: {e.source} -> {e.target}{relation_info}{weights_info}" diff --git a/test/db/tigergraphs/test_reserved_words.py b/test/db/tigergraphs/test_reserved_words.py index 234f5c37..ca76e3ed 100644 --- a/test/db/tigergraphs/test_reserved_words.py +++ b/test/db/tigergraphs/test_reserved_words.py @@ -83,7 +83,8 @@ def test_edges_sanitization_for_tigergraph(schema_with_incompatible_edges): } assert ( - sanitized_schema.core_schema.vertex_config.vertices[-1].fields[0].name == "id" + sanitized_schema.core_schema.vertex_config.vertices[-1].properties[0].name + == "id" ) assert sanitized_schema.core_schema.vertex_config.vertices[-1].identity[0] == "id" edge_a = sanitized_schema.core_schema.edge_config.edges[-2] diff --git a/test/hq/test_db_writer.py b/test/hq/test_db_writer.py index 7b345751..95a69c03 100644 --- a/test/hq/test_db_writer.py +++ b/test/hq/test_db_writer.py @@ -44,8 +44,8 @@ def __exit__(self, exc_type, exc, tb): def _build_schema() -> Schema: vertex_config = VertexConfig( vertices=[ - Vertex(name="blank_v", fields=[], identity=[]), - Vertex(name="target_v", fields=[Field(name="id")], identity=["id"]), + Vertex(name="blank_v", properties=[], identity=[]), + Vertex(name="target_v", properties=[Field(name="id")], identity=["id"]), ], blank_vertices=["blank_v"], ) @@ -113,11 +113,11 @@ def test_resolve_blank_edges_prefers_identity_join_over_zip(): def test_blank_vertex_default_identity_depends_on_db_flavor(): arango_cfg = VertexConfig( - vertices=[Vertex(name="blank_v", fields=[], identity=[])], + vertices=[Vertex(name="blank_v", properties=[], identity=[])], blank_vertices=["blank_v"], ) neo4j_cfg = VertexConfig( - vertices=[Vertex(name="blank_v", fields=[], identity=[])], + vertices=[Vertex(name="blank_v", properties=[], identity=[])], blank_vertices=["blank_v"], ) arango_cfg.finish_init() diff --git a/test/hq/test_ingestion_subset.py b/test/hq/test_ingestion_subset.py index 7750e1a4..4883f9c8 100644 --- a/test/hq/test_ingestion_subset.py +++ b/test/hq/test_ingestion_subset.py @@ -58,9 +58,9 @@ def _vertex_config_a_b_c() -> VertexConfig: return VertexConfig.from_dict( { "vertices": [ - {"name": "A", "fields": ["id"], "identity": ["id"]}, - {"name": "B", "fields": ["id"], "identity": ["id"]}, - {"name": "C", "fields": ["id"], "identity": ["id"]}, + {"name": "A", "properties": ["id"], "identity": ["id"]}, + {"name": "B", "properties": ["id"], "identity": ["id"]}, + {"name": "C", "properties": ["id"], "identity": ["id"]}, ] } ) diff --git a/test/migrate/test_diff.py b/test/migrate/test_diff.py index 8747d941..6236db06 100644 --- a/test/migrate/test_diff.py +++ b/test/migrate/test_diff.py @@ -12,12 +12,12 @@ def _schema_v1() -> Schema: "vertices": [ { "name": "person", - "fields": [{"name": "id", "type": "STRING"}, "name"], + "properties": [{"name": "id", "type": "STRING"}, "name"], "identity": ["id"], }, { "name": "company", - "fields": [{"name": "id", "type": "STRING"}, "name"], + "properties": [{"name": "id", "type": "STRING"}, "name"], "identity": ["id"], }, ] @@ -48,7 +48,7 @@ def _schema_v2() -> Schema: "vertices": [ { "name": "person", - "fields": [ + "properties": [ {"name": "id", "type": "STRING"}, {"name": "full_name", "type": "STRING"}, {"name": "age", "type": "INT"}, @@ -57,12 +57,12 @@ def _schema_v2() -> Schema: }, { "name": "company", - "fields": [{"name": "id", "type": "STRING"}, "name"], + "properties": [{"name": "id", "type": "STRING"}, "name"], "identity": ["id"], }, { "name": "country", - "fields": [{"name": "code", "type": "STRING"}], + "properties": [{"name": "code", "type": "STRING"}], "identity": ["code"], }, ] diff --git a/test/migrate/test_store_executor.py b/test/migrate/test_store_executor.py index 2c3bb81b..e769a0be 100644 --- a/test/migrate/test_store_executor.py +++ b/test/migrate/test_store_executor.py @@ -25,7 +25,7 @@ def _schema() -> Schema: "vertices": [ { "name": "person", - "fields": ["id", "name"], + "properties": ["id", "name"], "identity": ["id"], }, ] diff --git a/test/plot/test_plotter.py b/test/plot/test_plotter.py index ee227ea1..bed55237 100644 --- a/test/plot/test_plotter.py +++ b/test/plot/test_plotter.py @@ -3,6 +3,8 @@ import networkx as nx import pytest +from graflo.architecture.pipeline.runtime.actor import EdgeActor +from graflo.architecture.pipeline.runtime.actor.config import EdgeActorConfig from graflo.architecture.schema.edge import Edge from graflo.plot.plotter import ManifestPlotter @@ -20,17 +22,17 @@ def __init__( self, vertex_set: set[str], identity_by_vertex: dict[str, list[str]] | None = None, - fields_by_vertex: dict[str, list[str]] | None = None, + property_names_by_vertex: dict[str, list[str]] | None = None, ): self.vertex_set = vertex_set self._identity_by_vertex = identity_by_vertex or {} - self._fields_by_vertex = fields_by_vertex or {} + self._property_names_by_vertex = property_names_by_vertex or {} def identity_fields(self, vertex_name: str) -> list[str]: return self._identity_by_vertex.get(vertex_name, []) - def fields_names(self, vertex_name: str) -> list[str]: - return self._fields_by_vertex.get(vertex_name, []) + def property_names(self, vertex_name: str) -> list[str]: + return self._property_names_by_vertex.get(vertex_name, []) class _AgraphStub: @@ -64,7 +66,7 @@ def _build_plotter( configured_edges: dict, vertex_set: set[str], identity_by_vertex: dict[str, list[str]] | None = None, - fields_by_vertex: dict[str, list[str]] | None = None, + property_names_by_vertex: dict[str, list[str]] | None = None, ) -> ManifestPlotter: plotter = ManifestPlotter.__new__(ManifestPlotter) plotter.output_format = "pdf" @@ -79,7 +81,7 @@ def _build_plotter( vertex_config=_VertexConfigStub( vertex_set=vertex_set, identity_by_vertex=identity_by_vertex, - fields_by_vertex=fields_by_vertex, + property_names_by_vertex=property_names_by_vertex, ), ), ) @@ -98,7 +100,11 @@ def test_plot_vc2vc_filters_unknown_endpoints_and_logs_error(monkeypatch, caplog monkeypatch.setattr( plotter, "_discover_edges_from_resources", - lambda: {discovered_invalid_edge.edge_id: discovered_invalid_edge}, + lambda: ( + {discovered_invalid_edge.edge_id: discovered_invalid_edge}, + {}, + {}, + ), ) captured = {} @@ -119,21 +125,26 @@ def _fake_to_agraph(graph): def test_plot_vc2vc_preserves_labels_and_partition_grouping(monkeypatch): - edge_relation_field = Edge.from_dict( - {"source": "a", "target": "b", "relation_field": "edge_kind"} + edge_ab_actor = EdgeActor( + EdgeActorConfig.model_validate( + {"from": "a", "to": "b", "relation_field": "edge_kind"} + ) ) - edge_relation_key = Edge.from_dict( - {"source": "b", "target": "c", "relation_from_key": True} + edge_bc_actor = EdgeActor( + EdgeActorConfig.model_validate( + {"from": "b", "to": "c", "relation_from_key": True} + ) + ) + resource = SimpleNamespace() + resource.root = SimpleNamespace( + collect_actors=lambda: [edge_ab_actor, edge_bc_actor] ) plotter = _build_plotter( - configured_edges={ - edge_relation_field.edge_id: edge_relation_field, - edge_relation_key.edge_id: edge_relation_key, - }, + configured_edges={}, vertex_set={"a", "b", "c"}, ) - monkeypatch.setattr(plotter, "_discover_edges_from_resources", lambda: {}) + plotter.ingestion_model = SimpleNamespace(resources=[resource]) captured = {} @@ -173,7 +184,11 @@ def test_plot_vc2vc_appends_schema_version_to_stem(monkeypatch): vertex_set={"a", "b"}, ) plotter.schema.metadata.version = "2.3.4" - monkeypatch.setattr(plotter, "_discover_edges_from_resources", lambda: {}) + monkeypatch.setattr( + plotter, + "_discover_edges_from_resources", + lambda: ({}, {}, {}), + ) captured = {} def _fake_to_agraph(graph): @@ -191,7 +206,7 @@ def test_plot_vc2fields_appends_schema_version_to_stem(monkeypatch): configured_edges={}, vertex_set={"a"}, identity_by_vertex={"a": ["id"]}, - fields_by_vertex={"a": ["id", "name"]}, + property_names_by_vertex={"a": ["id", "name"]}, ) plotter.schema.metadata.version = "2.3.4" captured = {} diff --git a/test/test_filter_view.py b/test/test_filter_view.py index 46a9657c..4215adc1 100644 --- a/test/test_filter_view.py +++ b/test/test_filter_view.py @@ -99,9 +99,9 @@ def _vertex_config_fixture() -> VertexConfig: id_field = VertexField(name="id") return VertexConfig( vertices=[ - Vertex(name="project", fields=[id_field], identity=["id"]), - Vertex(name="task", fields=[id_field], identity=["id"]), - Vertex(name="milestone", fields=[id_field], identity=["id"]), + Vertex(name="project", properties=[id_field], identity=["id"]), + Vertex(name="task", properties=[id_field], identity=["id"]), + Vertex(name="milestone", properties=[id_field], identity=["id"]), ] ) diff --git a/test/test_filters_python.py b/test/test_filters_python.py index 2672353b..0c7558df 100644 --- a/test/test_filters_python.py +++ b/test/test_filters_python.py @@ -385,7 +385,7 @@ def vertex_config_with_filters(): """ vertices: - name: feature - fields: + properties: - name - value filters: @@ -477,7 +477,7 @@ def test_vertex_filter_no_filters_passes_all(sample_vertex_docs): from graflo.architecture.schema.vertex import VertexConfig vc = VertexConfig.model_validate( - {"vertices": [{"name": "raw", "fields": ["name", "value"]}]} + {"vertices": [{"name": "raw", "properties": ["name", "value"]}]} ) result = _apply_vertex_filters(vc, "raw", sample_vertex_docs) assert len(result) == len(sample_vertex_docs) diff --git a/uv.lock b/uv.lock index 02e230dc..9bc70a1c 100644 --- a/uv.lock +++ b/uv.lock @@ -396,7 +396,7 @@ dependencies = [ ] name = "graflo" source = {editable = "."} -version = "1.7.9" +version = "1.7.10" [package.metadata] provides-extras = ["dev", "docs", "plot"]