diff --git a/CHANGELOG.md b/CHANGELOG.md index 3970fee6..12caa0cc 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,36 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [1.7.9] - 2026-04-01 + +### Added + +- **`Bindings.get_connectors_for_resource(name)`** returns an ordered list of connectors (unique by hash) for an ingestion resource, supporting **1→n** resource–connector wiring. +- **`BoundSourceKind`** enum (`file`, `sql_table`, `sparql`) and **`ResourceConnector.bound_source_kind()`** describe the physical source modality of a connector (replacing the old “resource type” wording). +- **`Resource.drop_trivial_input_fields`** (default `false`): when `true`, removes **top-level** keys whose value is `null` or `""` from each input record before the actor pipeline runs—useful for wide, sparse rows without custom transforms. Does not recurse into nested objects. + +### Changed + +- **`DBWriter`**: No longer calls `Schema.finish_init()` or `IngestionModel.finish_init()` on every `write()`. The orchestrator (e.g. **`Caster.ingest`**) is responsible for initializing schema and ingestion model for the target DB before writes. This avoids redundant work on each batch and prevents the writer from resetting ingestion flags (`strict_references`, `allowed_vertex_names`) that **`Caster`** had already applied. +- **`DBWriter`**: Reuses a cached **`SchemaDBAware`** projection for a given connection DB type instead of rebuilding it on every `write()`. +- **Ingestion caps**: `IngestionParams.max_items` is documented and validated (`>= 1` when set). **`SparqlEndpointDataSource.iter_batches`** paginates without loading the full endpoint result into memory, uses **`ORDER BY ?s`** when the query has no `ORDER BY`, and honors **`limit`** as a subject count. **`SQLDataSource`** and offset/page **API** pagination pass a tighter per-request page size when a total cap is close (fewer over-fetched rows/items). +- **`RegistryBuilder`** registers **every** connector bound to each resource and dispatches on **`connector.bound_source_kind()`**; SQL registration uses the connector’s own table/schema fields instead of a resource-level table lookup. +- **Auto-join** (`_vertex_table_info`) resolves table metadata via the list API and **raises** if more than one `TableConnector` is bound to the same vertex/resource key used for disambiguation. + +### Breaking + + +- **`DBWriter`**: The **`dynamic_edges`** constructor argument was removed (it only drove the redundant `finish_init` call). Configure dynamic edge behavior via **`Caster`** / **`IngestionParams.dynamic_edges`** and ingestion **`finish_init`** as before. +- **`ResourceType`** removed in favor of **`BoundSourceKind`**; **`get_resource_type()`** removed in favor of **`bound_source_kind()`** on connectors (update imports and call sites). +- **`Bindings`**: **`get_connector_for_resource`**, **`get_resource_type`**, and **`get_table_info`** removed; use **`get_connectors_for_resource`** and connector fields / `bound_source_kind()` instead. +- **`connector_connection` / internal connector refs**: resolution allows only **connector `name`** or **canonical `hash`**. Using an ingestion **resource name** as a `connector` reference is no longer supported (resource names are no longer 1:1 with connectors). +- **`bind_resource`** and manifest **`resource_connector`** validation: additional rows for the same `resource` append connectors instead of replacing or conflicting. + +### Documentation + +- **Examples / docs**: `examples/9-connector-connection-proxy` and manifest guides updated for explicit connector names in `connector_connection`. Concepts and README clarify 1→n bindings and proxy wiring. +- **`Resource.drop_trivial_input_fields`**: described in [Concepts](docs/concepts/index.md) (DataSources vs Resources) and [Documentation home — Resource](docs/index.md#resource). + ## [1.7.7] - 2026-03-27 ### Changed @@ -20,8 +50,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **`Bindings.connector_connection_bindings`** (typed view), **`get_conn_proxy_for_connector`**, and **`bind_connector_to_conn_proxy`**: API aligned with HQ loaders (`ResourceMapper`, `GraphEngine`) for proxy-based source wiring. ### Changed -- **Connector reference resolution**: `connector_connection` entries may reference a connector by canonical **hash**, declared **`name`**, or a **`resource` name** when that resource is already mapped to the connector (mirrors validation in `Bindings`). -- **`Bindings` validation**: duplicate connector `name` values, conflicting resource→connector mappings, and conflicting `conn_proxy` for the same connector hash now fail fast with explicit errors. +- **Connector reference resolution**: `connector_connection` entries may reference a connector by canonical **hash**, declared **`name`**, or a **`resource` name** when that resource is already mapped to the connector (mirrors validation in `Bindings`). **Update (1.7.8):** resource-name aliasing for `connector` refs was removed; use **connector `name` or `hash`** only. +- **`Bindings` validation**: duplicate connector `name` values and conflicting `conn_proxy` for the same connector hash now fail fast with explicit errors. **Update (1.7.8):** many connectors may attach to the same ingestion resource (1→n); overlapping resource rows no longer raise “conflicting resource binding” for distinct connectors. ### Breaking - **`Bindings.from_dict` / manifest validation**: legacy top-level keys `postgres_connections`, `table_connectors`, `file_connectors`, and `sparql_connectors` are rejected. Migrate to the unified `connectors` + `resource_connector` (+ optional `connector_connection`) shape. diff --git a/README.md b/README.md index 8ed877df..c5c65915 100644 --- a/README.md +++ b/README.md @@ -76,7 +76,7 @@ ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph — same API for al - **Schema inference** — Generate graph schemas from PostgreSQL 3NF databases (PK/FK heuristics) or from OWL/RDFS ontologies (`owl:Class` → vertices, `owl:ObjectProperty` → edges, `owl:DatatypeProperty` → vertex fields). - **Typed fields** — Vertex fields and edge weights carry types (`INT`, `FLOAT`, `STRING`, `DATETIME`, `BOOL`) for validation and database-specific optimisation. - **Parallel batch processing** — Configurable batch sizes and multi-core execution. -- **Credential-free source contracts** — `Bindings.connector_connection` maps each `TableConnector` / `SparqlConnector` (by name, hash, or resource alias) to a `conn_proxy` label. Manifests stay free of secrets; a runtime `ConnectionProvider` resolves each proxy to concrete `GeneralizedConnConfig` (for example PostgreSQL or SPARQL endpoint settings). +- **Credential-free source contracts** — `Bindings.connector_connection` maps each `TableConnector` / `SparqlConnector` (by **connector name** or **hash**) to a `conn_proxy` label. Manifests stay free of secrets; a runtime `ConnectionProvider` resolves each proxy to concrete `GeneralizedConnConfig` (for example PostgreSQL or SPARQL endpoint settings). Ingestion resource names are separate and may map to multiple connectors. ## Documentation Full documentation is available at: [growgraph.github.io/graflo](https://growgraph.github.io/graflo) diff --git a/docs/concepts/index.md b/docs/concepts/index.md index 5a066352..af334b38 100644 --- a/docs/concepts/index.md +++ b/docs/concepts/index.md @@ -46,7 +46,7 @@ flowchart LR - **GraphManifest** — the canonical top-level contract that composes `schema`, `ingestion_model`, and `bindings`. - **Schema** — the declarative logical graph model (`Schema`): vertex/edge definitions, identities, typed fields, and DB profile. - **IngestionModel** — reusable resources and transforms used to map records into graph entities. -- **Bindings** — named `FileConnector` / `TableConnector` / `SparqlConnector` list plus `resource_connector` (resource→connector) and optional `connector_connection` (connector→`conn_proxy` for runtime `ConnectionProvider` resolution without secrets in the manifest). +- **Bindings** — named `FileConnector` / `TableConnector` / `SparqlConnector` list plus `resource_connector` (many rows per resource allowed: resource→0..n connectors) and optional `connector_connection` (connector **name** or **hash**→`conn_proxy` for runtime `ConnectionProvider` resolution without secrets in the manifest). Each connector exposes a **bound source modality** (`BoundSourceKind`: file, SQL table, SPARQL) for dispatch, distinct from the abstract ingestion **Resource**. - **Database-Independent Graph Representation** — a `GraphContainer` of vertices and edges, independent of any target database. - **Graph DB** — the target LPG store (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph). @@ -94,9 +94,9 @@ flowchart LR Res --> Ex --> Asm --> GC --> DBW ``` -- **Bindings** (`FileConnector`, `TableConnector`, `SparqlConnector`) describe *where* data comes from (file paths, SQL tables, SPARQL endpoints). Optional **`connector_connection`** entries assign each SQL/SPARQL connector a **`conn_proxy`** label; the `ConnectionProvider` turns that label into real connection config at runtime so manifests stay credential-free. +- **Bindings** (`FileConnector`, `TableConnector`, `SparqlConnector`) describe *where* data comes from (file paths, SQL tables, SPARQL endpoints). Multiple connectors may attach to the same ingestion resource name; optional **`connector_connection`** entries assign each SQL/SPARQL connector a **`conn_proxy`** by **connector `name` or `hash`** (not by resource name). The `ConnectionProvider` turns that label into real connection config at runtime so manifests stay credential-free. - **DataSources** (`AbstractDataSource` subclasses) handle *how* to read data in batches. Each carries a `DataSourceType` and is registered in the `DataSourceRegistry`. -- **Resources** define *what* to extract — each `Resource` is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. +- **Resources** define *what* to extract — each `Resource` is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Set **`drop_trivial_input_fields`: `true`** on a resource to strip top-level `null` / `""` fields from each row before the pipeline (optional, default `false`). - **GraphContainer** (covariant graph representation) collects the resulting vertices and edges in a database-independent format. - **DBWriter** pushes the graph data into the target LPG store (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph). @@ -176,6 +176,7 @@ classDiagram +connectors: list~ResourceConnector~ +resource_connector: list~ResourceConnectorBinding~ +connector_connection: list~ConnectorConnectionBinding~ + +get_connectors_for_resource(name) list +get_conn_proxy_for_connector(connector) str? +bind_connector_to_conn_proxy(connector, conn_proxy) } @@ -479,6 +480,7 @@ These are the two key abstractions that decouple *data retrieval* from *graph tr - **DataSources** (`AbstractDataSource` subclasses) — handle *where* and *how* data is read. Each carries a `DataSourceType` (`FILE`, `SQL`, `SPARQL`, `API`, `IN_MEMORY`). Many DataSources can bind to the same Resource by name via the `DataSourceRegistry`. - **Resources** (`Resource`) — handle *what* the data becomes in the LPG. Each Resource is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Because DataSources bind to Resources by name, the same transformation logic applies regardless of whether data arrives from a file, an API, or a SPARQL endpoint. + - Optional **`drop_trivial_input_fields`** (default `false` on the model): when `true`, each record is preprocessed by dropping **top-level** keys whose value is `null` or the empty string `""` before actors run. This trims sparse wide rows (many unused columns) without extra transforms; nested dicts and lists are not walked. ## Core Components diff --git a/docs/examples/example-9.md b/docs/examples/example-9.md index 8c6f5997..c8e90705 100644 --- a/docs/examples/example-9.md +++ b/docs/examples/example-9.md @@ -8,7 +8,7 @@ The manifest stays credential-free: `bindings.connector_connection` only contain ## Manifest: what `connector_connection` looks like -Inside `bindings` you explicitly map each connector to a proxy label: +Inside `bindings` you explicitly map each connector to a proxy label. The `connector` field must be a **connector `name`** or **canonical hash**, not an ingestion resource name (a resource may be bound to several connectors). ```yaml bindings: @@ -23,7 +23,7 @@ bindings: conn_proxy: postgres_source ``` -In the code, connectors omit `connector.name` and use `connector.resource_name` (so the manifest references are stable and human-readable). +In the companion script, each `TableConnector` sets `name` to match those references (here they match the table/resource names only for readability). ## Runtime: how the proxy label becomes a real DB config diff --git a/docs/getting_started/creating_manifest.md b/docs/getting_started/creating_manifest.md index 04512145..ef687ce9 100644 --- a/docs/getting_started/creating_manifest.md +++ b/docs/getting_started/creating_manifest.md @@ -74,6 +74,7 @@ Defines ingestion behavior. - `resources`: named pipelines (`name`) with ordered actor steps - `transforms`: reusable named transforms as a **list** (each entry must define `name`) and referenced from resources via `transform.call.use` +- Optional per-resource flags include **`drop_trivial_input_fields`** (default `false`): when `true`, top-level `null` or `""` fields are removed from each row before the pipeline—handy for sparse wide tables without extra transforms (shallow only; nested objects are unchanged). Use `ingestion_model` for **how source records become vertices/edges**. @@ -82,10 +83,10 @@ Use `ingestion_model` for **how source records become vertices/edges**. Defines source wiring (`Bindings`). - **`connectors`**: list of `FileConnector`, `TableConnector`, or `SparqlConnector` entries (where each row points at paths, tables, or RDF/SPARQL sources). -- **`resource_connector`**: list of `{"resource": "", "connector": ""}` rows linking `IngestionModel.resources[*].name` to a connector. -- **`connector_connection`** (optional): list of `{"connector": "", "conn_proxy": "