Skip to content

Latest commit

 

History

History
149 lines (111 loc) · 7.16 KB

File metadata and controls

149 lines (111 loc) · 7.16 KB

Vector Gateway Interface logo

vgi-java

Serve Haybarn catalogs, tables, and functions from a Java process over Apache Arrow IPC — the Java implementation of the VGI (Vector Gateway Interface) protocol.
Built by 🚜 Query.Farm

Maven Central License

VGI lets Haybarn — Query Farm's independent derived distribution of DuckDB — ATTACH a catalog whose schemas, tables, and functions live in an external worker process. The vgi extension speaks an Arrow-IPC RPC protocol to that worker; this library is everything you need to write the worker side in Java. Your code registers functions and tables against a Worker builder — the library handles the wire protocol, schema negotiation, batch streaming, pushdown, and transports.

Wire-compatible with the Python reference implementation and the Go port: all three serve the same integration suite against the same C++ extension.

What you can serve

  • Catalog tables — named tables with inline schemas, comments, tags, constraints, foreign keys, and per-column statistics that feed the engine's optimizer.
  • Scalar functions — annotation-driven (ScalarFn): declare a compute() method and the parameter annotations generate the spec, bind-time validation, and dispatch.
  • Table functions — streaming producers with projection pushdown, filter pushdown, row-id, sampling, and time-travel (AT) support.
  • Table-in/out functions — exchange-style streaming transforms over input batches.
  • Table buffering functions — sink/source functions that buffer all input before emitting (distributed-aggregation style lifecycles: process → combine → finalize).
  • Aggregate functions — partial aggregation with cross-process state combine.
  • Catalog versioning — semver data/implementation version negotiation, release manifests, multi-branch tables, transactions, and attach options.

Requirements

  • Java 21+ at runtime. The shared-memory side-channel (zero-copy batch transfer with a co-located engine) additionally requires JDK 22+; on 21 it transparently falls back to pipe transport.
  • Haybarn with the vgi extension installed on the client side (it's in Haybarn's signed community channel: INSTALL vgi FROM community).

Installation

Artifacts are published to Maven Central under the farm.query group.

Gradle (Kotlin DSL):

dependencies {
    implementation("farm.query:vgi:0.1.0")
}

Maven:

<dependency>
  <groupId>farm.query</groupId>
  <artifactId>vgi</artifactId>
  <version>0.1.0</version>
</dependency>

The RPC layer (farm.query:vgirpc) comes in transitively.

Quickstart

A worker with one scalar function:

import farm.query.vgi.Worker;
import farm.query.vgi.scalar.Const;
import farm.query.vgi.scalar.ScalarFn;
import farm.query.vgi.scalar.Vector;
import org.apache.arrow.vector.BigIntVector;

public final class DemoWorker {

    /** {@code multiply(value INT64, factor INT64 [const]) -> INT64} */
    static final class Multiply extends ScalarFn {
        @Override public String name() { return "multiply"; }
        @Override public String description() { return "Multiplies a value by a constant factor"; }

        public void compute(@Vector BigIntVector value, @Const long factor, BigIntVector result) {
            int rows = value.getValueCount();
            for (int i = 0; i < rows; i++) {
                if (value.isNull(i)) {
                    result.setNull(i);
                } else {
                    result.set(i, value.get(i) * factor);
                }
            }
        }
    }

    public static void main(String[] args) {
        Worker worker = Worker.builder()
                .catalogName("demo")
                .registerScalar(new Multiply());
        worker.runFromArgs(args); // stdio by default; --unix / --http via flags
    }
}

The compute() signature drives everything: @Vector parameters are per-row input columns, @Const parameters are bind-time constants, @Setting parameters read session settings, and the last unannotated Arrow vector is the framework-allocated output.

The worker JVM needs two flags — Apache Arrow requires access to java.nio internals, and the shared-memory transport uses the FFM API:

--add-opens=java.base/java.nio=org.apache.arrow.memory.core,ALL-UNNAMED
--enable-native-access=ALL-UNNAMED

With the Gradle application plugin, bake them into the start script so the worker binary is self-contained:

application {
    mainClass.set("DemoWorker")
    applicationDefaultJvmArgs = listOf(
        "--add-opens=java.base/java.nio=org.apache.arrow.memory.core,ALL-UNNAMED",
        "--enable-native-access=ALL-UNNAMED",
    )
}

Without the --add-opens flag the worker fails at first query with Failed to initialize MemoryUtil.

Attach and query it from Haybarn:

INSTALL vgi FROM community;
LOAD vgi;
ATTACH 'demo' AS demo (TYPE vgi, LOCATION 'launch:/path/to/demo-worker');
SELECT demo.multiply(21, 2);  -- 42

The launch: location scheme starts the worker once behind a flock-coordinated Unix socket and reuses it across queries and engine processes — essential for JVM workers, which are expensive to cold-start. Plain subprocess (/path/to/worker) and http(s):// locations also work.

Example worker

The vgi-example-worker module (not published) is a complete worker with 90+ functions — scalar, table, aggregate, table-in/out, buffering, partitioned, multi-branch, transactional — that serves the canonical VGI integration suite. It is the best place to look for working patterns of any feature.

Related projects

Repository What it is
Query-farm-haybarn/haybarn Haybarn — the independent derived distribution of DuckDB by Query Farm
Query-farm/vgi The vgi engine extension (C++) — the client side of the protocol
Query-farm/vgi-python Python reference implementation of the worker side
Query-farm/vgi-go Go implementation of the worker side
Query-farm/vgi-rpc-java The transport-agnostic Arrow RPC framework this library builds on

License

Query Farm Source-Available License, Version 1.0.