Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 32 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,36 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).


## [1.7.9] - 2026-04-01

### Added

- **`Bindings.get_connectors_for_resource(name)`** returns an ordered list of connectors (unique by hash) for an ingestion resource, supporting **1→n** resource–connector wiring.
- **`BoundSourceKind`** enum (`file`, `sql_table`, `sparql`) and **`ResourceConnector.bound_source_kind()`** describe the physical source modality of a connector (replacing the old “resource type” wording).
- **`Resource.drop_trivial_input_fields`** (default `false`): when `true`, removes **top-level** keys whose value is `null` or `""` from each input record before the actor pipeline runs—useful for wide, sparse rows without custom transforms. Does not recurse into nested objects.

### Changed

- **`DBWriter`**: No longer calls `Schema.finish_init()` or `IngestionModel.finish_init()` on every `write()`. The orchestrator (e.g. **`Caster.ingest`**) is responsible for initializing schema and ingestion model for the target DB before writes. This avoids redundant work on each batch and prevents the writer from resetting ingestion flags (`strict_references`, `allowed_vertex_names`) that **`Caster`** had already applied.
- **`DBWriter`**: Reuses a cached **`SchemaDBAware`** projection for a given connection DB type instead of rebuilding it on every `write()`.
- **Ingestion caps**: `IngestionParams.max_items` is documented and validated (`>= 1` when set). **`SparqlEndpointDataSource.iter_batches`** paginates without loading the full endpoint result into memory, uses **`ORDER BY ?s`** when the query has no `ORDER BY`, and honors **`limit`** as a subject count. **`SQLDataSource`** and offset/page **API** pagination pass a tighter per-request page size when a total cap is close (fewer over-fetched rows/items).
- **`RegistryBuilder`** registers **every** connector bound to each resource and dispatches on **`connector.bound_source_kind()`**; SQL registration uses the connector’s own table/schema fields instead of a resource-level table lookup.
- **Auto-join** (`_vertex_table_info`) resolves table metadata via the list API and **raises** if more than one `TableConnector` is bound to the same vertex/resource key used for disambiguation.

### Breaking


- **`DBWriter`**: The **`dynamic_edges`** constructor argument was removed (it only drove the redundant `finish_init` call). Configure dynamic edge behavior via **`Caster`** / **`IngestionParams.dynamic_edges`** and ingestion **`finish_init`** as before.
- **`ResourceType`** removed in favor of **`BoundSourceKind`**; **`get_resource_type()`** removed in favor of **`bound_source_kind()`** on connectors (update imports and call sites).
- **`Bindings`**: **`get_connector_for_resource`**, **`get_resource_type`**, and **`get_table_info`** removed; use **`get_connectors_for_resource`** and connector fields / `bound_source_kind()` instead.
- **`connector_connection` / internal connector refs**: resolution allows only **connector `name`** or **canonical `hash`**. Using an ingestion **resource name** as a `connector` reference is no longer supported (resource names are no longer 1:1 with connectors).
- **`bind_resource`** and manifest **`resource_connector`** validation: additional rows for the same `resource` append connectors instead of replacing or conflicting.

### Documentation

- **Examples / docs**: `examples/9-connector-connection-proxy` and manifest guides updated for explicit connector names in `connector_connection`. Concepts and README clarify 1→n bindings and proxy wiring.
- **`Resource.drop_trivial_input_fields`**: described in [Concepts](docs/concepts/index.md) (DataSources vs Resources) and [Documentation home — Resource](docs/index.md#resource).

## [1.7.7] - 2026-03-27

### Changed
Expand All @@ -20,8 +50,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **`Bindings.connector_connection_bindings`** (typed view), **`get_conn_proxy_for_connector`**, and **`bind_connector_to_conn_proxy`**: API aligned with HQ loaders (`ResourceMapper`, `GraphEngine`) for proxy-based source wiring.

### Changed
- **Connector reference resolution**: `connector_connection` entries may reference a connector by canonical **hash**, declared **`name`**, or a **`resource` name** when that resource is already mapped to the connector (mirrors validation in `Bindings`).
- **`Bindings` validation**: duplicate connector `name` values, conflicting resource→connector mappings, and conflicting `conn_proxy` for the same connector hash now fail fast with explicit errors.
- **Connector reference resolution**: `connector_connection` entries may reference a connector by canonical **hash**, declared **`name`**, or a **`resource` name** when that resource is already mapped to the connector (mirrors validation in `Bindings`). **Update (1.7.8):** resource-name aliasing for `connector` refs was removed; use **connector `name` or `hash`** only.
- **`Bindings` validation**: duplicate connector `name` values and conflicting `conn_proxy` for the same connector hash now fail fast with explicit errors. **Update (1.7.8):** many connectors may attach to the same ingestion resource (1→n); overlapping resource rows no longer raise “conflicting resource binding” for distinct connectors.

### Breaking
- **`Bindings.from_dict` / manifest validation**: legacy top-level keys `postgres_connections`, `table_connectors`, `file_connectors`, and `sparql_connectors` are rejected. Migrate to the unified `connectors` + `resource_connector` (+ optional `connector_connection`) shape.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph — same API for al
- **Schema inference** — Generate graph schemas from PostgreSQL 3NF databases (PK/FK heuristics) or from OWL/RDFS ontologies (`owl:Class` → vertices, `owl:ObjectProperty` → edges, `owl:DatatypeProperty` → vertex fields).
- **Typed fields** — Vertex fields and edge weights carry types (`INT`, `FLOAT`, `STRING`, `DATETIME`, `BOOL`) for validation and database-specific optimisation.
- **Parallel batch processing** — Configurable batch sizes and multi-core execution.
- **Credential-free source contracts** — `Bindings.connector_connection` maps each `TableConnector` / `SparqlConnector` (by name, hash, or resource alias) to a `conn_proxy` label. Manifests stay free of secrets; a runtime `ConnectionProvider` resolves each proxy to concrete `GeneralizedConnConfig` (for example PostgreSQL or SPARQL endpoint settings).
- **Credential-free source contracts** — `Bindings.connector_connection` maps each `TableConnector` / `SparqlConnector` (by **connector name** or **hash**) to a `conn_proxy` label. Manifests stay free of secrets; a runtime `ConnectionProvider` resolves each proxy to concrete `GeneralizedConnConfig` (for example PostgreSQL or SPARQL endpoint settings). Ingestion resource names are separate and may map to multiple connectors.

## Documentation
Full documentation is available at: [growgraph.github.io/graflo](https://growgraph.github.io/graflo)
Expand Down
8 changes: 5 additions & 3 deletions docs/concepts/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ flowchart LR
- **GraphManifest** — the canonical top-level contract that composes `schema`, `ingestion_model`, and `bindings`.
- **Schema** — the declarative logical graph model (`Schema`): vertex/edge definitions, identities, typed fields, and DB profile.
- **IngestionModel** — reusable resources and transforms used to map records into graph entities.
- **Bindings** — named `FileConnector` / `TableConnector` / `SparqlConnector` list plus `resource_connector` (resource→connector) and optional `connector_connection` (connector→`conn_proxy` for runtime `ConnectionProvider` resolution without secrets in the manifest).
- **Bindings** — named `FileConnector` / `TableConnector` / `SparqlConnector` list plus `resource_connector` (many rows per resource allowed: resource→0..n connectors) and optional `connector_connection` (connector **name** or **hash**→`conn_proxy` for runtime `ConnectionProvider` resolution without secrets in the manifest). Each connector exposes a **bound source modality** (`BoundSourceKind`: file, SQL table, SPARQL) for dispatch, distinct from the abstract ingestion **Resource**.
- **Database-Independent Graph Representation** — a `GraphContainer` of vertices and edges, independent of any target database.
- **Graph DB** — the target LPG store (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph).

Expand Down Expand Up @@ -94,9 +94,9 @@ flowchart LR
Res --> Ex --> Asm --> GC --> DBW
```

- **Bindings** (`FileConnector`, `TableConnector`, `SparqlConnector`) describe *where* data comes from (file paths, SQL tables, SPARQL endpoints). Optional **`connector_connection`** entries assign each SQL/SPARQL connector a **`conn_proxy`** label; the `ConnectionProvider` turns that label into real connection config at runtime so manifests stay credential-free.
- **Bindings** (`FileConnector`, `TableConnector`, `SparqlConnector`) describe *where* data comes from (file paths, SQL tables, SPARQL endpoints). Multiple connectors may attach to the same ingestion resource name; optional **`connector_connection`** entries assign each SQL/SPARQL connector a **`conn_proxy`** by **connector `name` or `hash`** (not by resource name). The `ConnectionProvider` turns that label into real connection config at runtime so manifests stay credential-free.
- **DataSources** (`AbstractDataSource` subclasses) handle *how* to read data in batches. Each carries a `DataSourceType` and is registered in the `DataSourceRegistry`.
- **Resources** define *what* to extract — each `Resource` is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements.
- **Resources** define *what* to extract — each `Resource` is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Set **`drop_trivial_input_fields`: `true`** on a resource to strip top-level `null` / `""` fields from each row before the pipeline (optional, default `false`).
- **GraphContainer** (covariant graph representation) collects the resulting vertices and edges in a database-independent format.
- **DBWriter** pushes the graph data into the target LPG store (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph).

Expand Down Expand Up @@ -176,6 +176,7 @@ classDiagram
+connectors: list~ResourceConnector~
+resource_connector: list~ResourceConnectorBinding~
+connector_connection: list~ConnectorConnectionBinding~
+get_connectors_for_resource(name) list
+get_conn_proxy_for_connector(connector) str?
+bind_connector_to_conn_proxy(connector, conn_proxy)
}
Expand Down Expand Up @@ -479,6 +480,7 @@ These are the two key abstractions that decouple *data retrieval* from *graph tr
- **DataSources** (`AbstractDataSource` subclasses) — handle *where* and *how* data is read. Each carries a `DataSourceType` (`FILE`, `SQL`, `SPARQL`, `API`, `IN_MEMORY`). Many DataSources can bind to the same Resource by name via the `DataSourceRegistry`.

- **Resources** (`Resource`) — handle *what* the data becomes in the LPG. Each Resource is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Because DataSources bind to Resources by name, the same transformation logic applies regardless of whether data arrives from a file, an API, or a SPARQL endpoint.
- Optional **`drop_trivial_input_fields`** (default `false` on the model): when `true`, each record is preprocessed by dropping **top-level** keys whose value is `null` or the empty string `""` before actors run. This trims sparse wide rows (many unused columns) without extra transforms; nested dicts and lists are not walked.

## Core Components

Expand Down
4 changes: 2 additions & 2 deletions docs/examples/example-9.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The manifest stays credential-free: `bindings.connector_connection` only contain

## Manifest: what `connector_connection` looks like

Inside `bindings` you explicitly map each connector to a proxy label:
Inside `bindings` you explicitly map each connector to a proxy label. The `connector` field must be a **connector `name`** or **canonical hash**, not an ingestion resource name (a resource may be bound to several connectors).

```yaml
bindings:
Expand All @@ -23,7 +23,7 @@ bindings:
conn_proxy: postgres_source
```

In the code, connectors omit `connector.name` and use `connector.resource_name` (so the manifest references are stable and human-readable).
In the companion script, each `TableConnector` sets `name` to match those references (here they match the table/resource names only for readability).

## Runtime: how the proxy label becomes a real DB config

Expand Down
7 changes: 4 additions & 3 deletions docs/getting_started/creating_manifest.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ Defines ingestion behavior.

- `resources`: named pipelines (`name`) with ordered actor steps
- `transforms`: reusable named transforms as a **list** (each entry must define `name`) and referenced from resources via `transform.call.use`
- Optional per-resource flags include **`drop_trivial_input_fields`** (default `false`): when `true`, top-level `null` or `""` fields are removed from each row before the pipeline—handy for sparse wide tables without extra transforms (shallow only; nested objects are unchanged).

Use `ingestion_model` for **how source records become vertices/edges**.

Expand All @@ -82,10 +83,10 @@ Use `ingestion_model` for **how source records become vertices/edges**.
Defines source wiring (`Bindings`).

- **`connectors`**: list of `FileConnector`, `TableConnector`, or `SparqlConnector` entries (where each row points at paths, tables, or RDF/SPARQL sources).
- **`resource_connector`**: list of `{"resource": "<ingestion resource name>", "connector": "<connector name or reference>"}` rows linking `IngestionModel.resources[*].name` to a connector.
- **`connector_connection`** (optional): list of `{"connector": "<name|hash|resource alias>", "conn_proxy": "<label>"}` rows. This keeps manifests **non-secret**: only proxy *names* appear in YAML; runtime code registers each `conn_proxy` on a `ConnectionProvider` with the real `GeneralizedConnConfig` (PostgreSQL, SPARQL, etc.).
- **`resource_connector`**: list of `{"resource": "<ingestion resource name>", "connector": "<connector name or hash>"}` rows linking `IngestionModel.resources[*].name` to a connector. The same `resource` may appear on **multiple rows** with different `connector` values (several physical sources for one pipeline).
- **`connector_connection`** (optional): list of `{"connector": "<connector name or hash>", "conn_proxy": "<label>"}` rows. This keeps manifests **non-secret**: only proxy *names* appear in YAML; runtime code registers each `conn_proxy` on a `ConnectionProvider` with the real `GeneralizedConnConfig` (PostgreSQL, SPARQL, etc.).

Connector references in `resource_connector` / `connector_connection` must match a connector `name` (or resolve via hash / resource alias as documented in `Bindings`). Duplicate connector names and conflicting resource or proxy mappings are rejected at validation time.
Connector references in `resource_connector` / `connector_connection` must match a connector’s declared **`name`** or canonical **`hash`**. Ingestion **resource names** are not connector references (they can map 1→*n*). Duplicate connector `name` values and conflicting `conn_proxy` mappings for the same connector hash are rejected at validation time.

The block can be left empty in-file (`bindings: {}`) and supplied at runtime for env-specific deployments.

Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ engine.define_and_ingest(

Here `schema` defines the logical graph, while `ingestion_model` defines resources/transforms and `bindings` maps resources to physical data sources. See [Creating a Manifest](creating_manifest.md) and [Concepts — Schema](../concepts/index.md#schema) for details.

`Bindings` maps resource names (from `IngestionModel`) to their physical data sources:
`Bindings` maps resource names (from `IngestionModel`) to one or more physical data sources (the same resource may list several connectors):
- **FileConnector**: For file-based resources with `regex` for matching filenames and `sub_path` for the directory to search
- **TableConnector**: For PostgreSQL table resources (table/schema/view metadata on the connector; connection URLs and secrets are **not** stored in the manifest when using **`connector_connection`** — see below)
- **SparqlConnector**: RDF class / SPARQL endpoint wiring (same proxy pattern as SQL when needed)
Expand Down
2 changes: 2 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ Resources and transforms are part of `IngestionModel`, not `Schema`.

A `Resource` is the central abstraction that bridges data sources and the graph schema. Each Resource defines a reusable pipeline of actors (descend, transform, vertex, edge) that maps raw records to graph elements. Data sources bind to Resources by name via the `DataSourceRegistry`, so the same transformation logic applies regardless of whether data arrives from a file, an API, or a SPARQL endpoint.

For wide rows with many empty or null columns, **`drop_trivial_input_fields`** (default `false`) removes only **top-level** keys whose value is `null` or `""` before the pipeline runs—no recursion into nested structures.

### DataSourceRegistry

The `DataSourceRegistry` manages `AbstractDataSource` adapters, each carrying a `DataSourceType`:
Expand Down
1 change: 1 addition & 0 deletions examples/9-connector-connection-proxy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ This example demonstrates the non-secret runtime indirection:

Key points:
- The manifest stores only `conn_proxy` labels inside `bindings.connector_connection`.
- Each `connector` row references a connector by **`name` or `hash`** (not by ingestion resource name).
- The runtime script registers the real `PostgresConfig` under that proxy label
via `InMemoryConnectionProvider`.
- `provider.bind_from_bindings(bindings=...)` connects manifest connectors
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,25 +48,30 @@ def _load_mock_postgres_schema(*, postgres_conf: PostgresConfig) -> None:

def make_explicit_postgres_bindings(conn_proxy: str) -> Bindings:
"""Create manifest bindings with explicit connector_connection proxy labels."""
# In this example we keep `connector.name` omitted and rely on
# connector.resource_name as the stable manifest alias.
# Each connector has an explicit `name` so `connector_connection.connector`
# can reference it. Ingestion resource names still come from `resource_name`
# (or `resource_connector`); those names are not valid connector refs.
connectors = [
TableConnector(
name="users",
table_name="users",
schema_name="public",
resource_name="users",
),
TableConnector(
name="products",
table_name="products",
schema_name="public",
resource_name="products",
),
TableConnector(
name="purchases",
table_name="purchases",
schema_name="public",
resource_name="purchases",
),
TableConnector(
name="follows",
table_name="follows",
schema_name="public",
resource_name="follows",
Expand Down
4 changes: 2 additions & 2 deletions graflo/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,8 @@
GraphModel,
Index,
IngestionModel,
BoundSourceKind,
ResourceConnector,
ResourceType,
Resource,
SparqlConnector,
Schema,
Expand Down Expand Up @@ -135,8 +135,8 @@
"FileConnector",
"Bindings",
"JoinClause",
"BoundSourceKind",
"ResourceConnector",
"ResourceType",
"SparqlConnector",
"TableConnector",
]
4 changes: 2 additions & 2 deletions graflo/architecture/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
JoinClause,
ProtoTransform,
Resource,
BoundSourceKind,
ResourceConnector,
ResourceType,
SparqlConnector,
TableConnector,
Transform,
Expand Down Expand Up @@ -54,8 +54,8 @@
"JoinClause",
"ProtoTransform",
"Resource",
"BoundSourceKind",
"ResourceConnector",
"ResourceType",
"Schema",
"SchemaDBAware",
"SparqlConnector",
Expand Down
Loading