Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/java-integ-glue.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ concurrency:

jobs:
integration-test:
if: github.event_name != 'pull_request_target' || github.event.pull_request.head.repo.full_name == github.repository
runs-on: ubuntu-24.04
timeout-minutes: 30
steps:
Expand Down
5 changes: 3 additions & 2 deletions .github/workflows/java-integ-unity.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,15 +55,16 @@ jobs:
run: |
echo "Waiting for Unity Catalog to be ready..."
timeout 120 bash -c '
until curl -sf http://localhost:8080/api/2.1/unity-catalog/catalogs > /dev/null 2>&1; do
echo "Waiting for Unity Catalog API..."
until [ "$(docker inspect --format="{{.State.Health.Status}}" unity-catalog 2>/dev/null)" = "healthy" ]; do
echo "Waiting for Unity Catalog container health..."
sleep 5
done
' || {
echo "Timeout waiting for Unity Catalog"
docker compose -f docker/unity/docker-compose.yml logs
exit 1
}
curl -sf http://localhost:8080/api/2.1/unity-catalog/catalogs > /dev/null
echo "Unity Catalog is ready"
- name: Create test catalog
run: |
Expand Down
22 changes: 22 additions & 0 deletions .github/workflows/java-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,17 @@ jobs:
"lance-namespace-glue"
"lance-namespace-hive2"
"lance-namespace-hive3"
"lance-namespace-iceberg"
"lance-namespace-unity"
"lance-namespace-polaris"
)

# Implementation artifacts also publish an attached shaded bundle classifier.
BUNDLE_ARTIFACTS=(
"lance-namespace-glue"
"lance-namespace-hive2"
"lance-namespace-hive3"
"lance-namespace-iceberg"
"lance-namespace-unity"
"lance-namespace-polaris"
)
Expand All @@ -122,6 +133,17 @@ jobs:
fi
done

for ARTIFACT_ID in "${BUNDLE_ARTIFACTS[@]}"; do
URL="https://repo1.maven.org/maven2/org/lance/${ARTIFACT_ID}/${VERSION}/${ARTIFACT_ID}-${VERSION}-bundle.jar"

if curl --head --silent --fail "$URL" > /dev/null 2>&1; then
echo "OK ${ARTIFACT_ID} bundle is available"
else
echo "X ${ARTIFACT_ID} bundle is not yet available"
ALL_AVAILABLE=false
fi
done

if [ "$ALL_AVAILABLE" = true ]; then
echo ""
echo "All artifacts are now available in Maven Central!"
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/python-integ-glue.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ concurrency:

jobs:
integration-test:
if: github.event_name != 'pull_request_target' || github.event.pull_request.head.repo.full_name == github.repository
runs-on: ubuntu-24.04
timeout-minutes: 30
steps:
Expand Down
2 changes: 1 addition & 1 deletion docker/unity/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ services:
networks:
- unity-network
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/api/2.1/unity-catalog/catalogs"]
test: ["CMD", "wget", "-q", "-O", "/dev/null", "http://localhost:8080/api/2.1/unity-catalog/catalogs"]
interval: 10s
timeout: 10s
retries: 10
Expand Down
16 changes: 10 additions & 6 deletions docs/src/glue.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ The **table location** is stored in the [`StorageDescriptor.Location`](https://d

## Lance Table Identification

A table in AWS Glue is identified as a Lance table when it meets the following criteria: the `TableType` is `EXTERNAL_TABLE`, and the `Parameters` map contains a key `table_type` with value `lance` (case insensitive). The `StorageDescriptor.Location` must point to a valid Lance table root directory.
A table in AWS Glue is identified as a Lance table when it meets the following criteria: the `TableType` is `EXTERNAL_TABLE`, and the `Parameters` map contains a key `table_type` with value `lance` (case insensitive). The `StorageDescriptor.Location` may be declared before a Lance dataset exists; storage is checked only for `include_declared=false` listing or `check_declared=true` describe requests.

## Basic Operations

Expand Down Expand Up @@ -191,9 +191,10 @@ The implementation:
- `DatabaseName`: the database name
- `TableInput.Name`: the table name
- `TableInput.TableType`: `EXTERNAL_TABLE`
- `TableInput.Parameters`: include `table_type=lance` and other properties
- `TableInput.Parameters`: request `properties` merged with implementation markers such as `table_type=lance`
- `TableInput.StorageDescriptor.Location`: the specified table location
4. POST the CreateTable request to Glue
5. Return the declared table location, catalog table properties, optional storage options, and `managed_versioning=false`

**Error Handling:**

Expand All @@ -215,7 +216,8 @@ The implementation:
2. Verify the namespace exists using [GetDatabase](https://docs.aws.amazon.com/glue/latest/webapi/API_GetDatabase.html)
3. Use [GetTables](https://docs.aws.amazon.com/glue/latest/webapi/API_GetTables.html) with `CatalogId` and `DatabaseName`
4. Filter tables where `Parameters.table_type=lance` (case insensitive)
5. Sort the results and apply pagination using `NextToken`
5. If `include_declared=false`, only include catalog entries whose `StorageDescriptor.Location` can be opened as a Lance dataset
6. Sort the results and apply pagination using `NextToken`

**Error Handling:**

Expand All @@ -227,14 +229,15 @@ If the Glue service is unavailable, return error code `17` (ServiceUnavailable).

### DescribeTable

Retrieves metadata for a Lance table. Only `load_detailed_metadata=false` is supported. When `load_detailed_metadata=false`, only the table location and storage_options are returned; other fields (version, table_uri, schema, stats) are null.
Retrieves metadata for a Lance table. Only `load_detailed_metadata=false` is supported. The response includes the table location, catalog table properties, `managed_versioning=false`, and any implementation storage options that should be returned to the caller.

The implementation:

1. Parse the table identifier to extract catalog, database, and table name
2. Use [GetTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetTable.html) with `CatalogId`, `DatabaseName`, and `Name`
3. Validate that the table is a Lance table (check `Parameters.table_type=lance`)
4. Return the table location from `StorageDescriptor.Location` and storage_options from `Parameters`
4. Return the table location from `StorageDescriptor.Location` and catalog properties from `Parameters`
5. If `check_declared=true`, set `is_only_declared=true` when the location cannot be opened as a Lance dataset

**Error Handling:**

Expand All @@ -255,7 +258,8 @@ The implementation:
1. Parse the table identifier to extract catalog, database, and table name
2. Use [GetTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetTable.html) to retrieve and validate the table is a Lance table
3. Use [DeleteTable](https://docs.aws.amazon.com/glue/latest/webapi/API_DeleteTable.html) with `CatalogId`, `DatabaseName`, and `Name`
4. The underlying Lance table data at `StorageDescriptor.Location` is not deleted
4. Return the table id, location, and catalog properties
5. The underlying Lance table data at `StorageDescriptor.Location` is not deleted

**Error Handling:**

Expand Down
14 changes: 9 additions & 5 deletions docs/src/hive2.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ The **table location** is stored in the `location` field of the table's `storage

## Lance Table Identification

A table in HMS is identified as a Lance table when it meets the following criteria: the `tableType` is `EXTERNAL_TABLE`, and the `parameters` map contains a key `table_type` with value `lance` (case insensitive). The `location` in `storageDescriptor` must point to a valid Lance table root directory.
A table in HMS is identified as a Lance table when it meets the following criteria: the `tableType` is `EXTERNAL_TABLE`, and the `parameters` map contains a key `table_type` with value `lance` (case insensitive). The `location` in `storageDescriptor` may be declared before a Lance dataset exists; storage is checked only for `include_declared=false` listing or `check_declared=true` describe requests.

## Basic Operations

Expand Down Expand Up @@ -116,8 +116,9 @@ The implementation:
2. Verify the parent namespace exists
3. Create an HMS Table object with `tableType=EXTERNAL_TABLE`
4. Set the storage descriptor with the specified or default location. When location is not specified, it defaults to `{root}/{database}.db/{table}`
5. Add `table_type=lance` to the table parameters
5. Merge request `properties` with required table parameters such as `table_type=lance` and `managed_by=storage`
6. Register the table in HMS
7. Return the declared table location, table parameters, and `managed_versioning=false`

**Error Handling:**

Expand All @@ -137,7 +138,8 @@ The implementation:
2. Verify the namespace exists
3. Retrieve all tables in the database
4. Filter tables where `parameters.table_type=lance`
5. Sort the results and apply pagination
5. If `include_declared=false`, only include catalog entries whose storage descriptor location can be opened as a Lance dataset
6. Sort the results and apply pagination

**Error Handling:**

Expand All @@ -147,14 +149,15 @@ If the HMS connection fails, return error code `17` (ServiceUnavailable).

### DescribeTable

Retrieves metadata for a Lance table. Only `load_detailed_metadata=false` is supported. When `load_detailed_metadata=false`, only the table location is returned; other fields (version, table_uri, schema, stats) are null.
Retrieves metadata for a Lance table. Only `load_detailed_metadata=false` is supported. The response includes the table location, HMS table parameters as `properties`, and `managed_versioning=false`.

The implementation:

1. Parse the table identifier
2. Retrieve the Table object from HMS
3. Validate that it is a Lance table (check `table_type=lance`)
4. Return the table location from `storageDescriptor.location`
4. Return the table location from `storageDescriptor.location` and the table parameters as `properties`
5. If `check_declared=true`, set `is_only_declared=true` when the location cannot be opened as a Lance dataset

**Error Handling:**

Expand Down Expand Up @@ -191,6 +194,7 @@ The implementation:
1. Parse the table identifier
2. Retrieve the Table object and validate it is a Lance table
3. Drop the table from HMS with `deleteData=false`
4. Return the table id, location, and table parameters

**Error Handling:**

Expand Down
14 changes: 9 additions & 5 deletions docs/src/hive3.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ The **table location** is stored in the [`location`](https://github.com/apache/h

## Lance Table Identification

A table in HMS is identified as a Lance table when it meets the following criteria: the `tableType` is `EXTERNAL_TABLE`, and the `parameters` map contains a key `table_type` with value `lance` (case insensitive). The `location` in `storageDescriptor` must point to a valid Lance table root directory.
A table in HMS is identified as a Lance table when it meets the following criteria: the `tableType` is `EXTERNAL_TABLE`, and the `parameters` map contains a key `table_type` with value `lance` (case insensitive). The `location` in `storageDescriptor` may be declared before a Lance dataset exists; storage is checked only for `include_declared=false` listing or `check_declared=true` describe requests.

## Basic Operations

Expand Down Expand Up @@ -123,8 +123,9 @@ The implementation:
2. Verify the parent namespace exists
3. Create an HMS Table object with `tableType=EXTERNAL_TABLE`
4. Set the storage descriptor with the specified or default location. When location is not specified, it defaults to `{root}/{database}.db/{table}` for the default `hive` catalog (hive2-compatible), or `{root}/{catalog}/{database}.db/{table}` for other catalogs
5. Add `table_type=lance` to the table parameters
5. Merge request `properties` with required table parameters such as `table_type=lance` and `managed_by=storage`
6. Register the table in HMS
7. Return the declared table location, table parameters, and `managed_versioning=false`

**Error Handling:**

Expand All @@ -144,7 +145,8 @@ The implementation:
2. Verify the namespace exists
3. Retrieve all tables in the database
4. Filter tables where `parameters.table_type=lance`
5. Sort the results and apply pagination
5. If `include_declared=false`, only include catalog entries whose storage descriptor location can be opened as a Lance dataset
6. Sort the results and apply pagination

**Error Handling:**

Expand All @@ -154,14 +156,15 @@ If the HMS connection fails, return error code `17` (ServiceUnavailable).

### DescribeTable

Retrieves metadata for a Lance table. Only `load_detailed_metadata=false` is supported. When `load_detailed_metadata=false`, only the table location is returned; other fields (version, table_uri, schema, stats) are null.
Retrieves metadata for a Lance table. Only `load_detailed_metadata=false` is supported. The response includes the table location, HMS table parameters as `properties`, and `managed_versioning=false`.

The implementation:

1. Parse the table identifier
2. Retrieve the Table object from HMS
3. Validate that it is a Lance table (check `table_type=lance`)
4. Return the table location from `storageDescriptor.location`
4. Return the table location from `storageDescriptor.location` and the table parameters as `properties`
5. If `check_declared=true`, set `is_only_declared=true` when the location cannot be opened as a Lance dataset

**Error Handling:**

Expand Down Expand Up @@ -198,6 +201,7 @@ The implementation:
1. Parse the table identifier
2. Retrieve the Table object and validate it is a Lance table
3. Drop the table from HMS with `deleteData=false`
4. Return the table id, location, and table parameters

**Error Handling:**

Expand Down
15 changes: 9 additions & 6 deletions docs/src/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ The **table location** is stored in the `location` field of the Iceberg table me

## Lance Table Identification

A table in Iceberg REST Catalog is identified as a Lance table when the `properties` map contains a key `table_type` with value `lance` (case insensitive). The `location` must point to a valid Lance table root directory. The Iceberg table itself serves as a metadata wrapper, with the actual data stored in Lance format.
A table in Iceberg REST Catalog is identified as a Lance table when the `properties` map contains a key `table_type` with value `lance` (case insensitive). The `location` may be declared before a Lance dataset exists. The Iceberg table itself serves as a metadata wrapper, with the actual data stored in Lance format once the table is materialized.

## Basic Operations

Expand Down Expand Up @@ -146,7 +146,7 @@ The implementation:
- `schema`: a dummy Iceberg schema with a single nullable string column `dummy`
- `properties`: table properties including `table_type=lance`
6. POST to `/v1/{prefix}/namespaces/{namespace}/tables`
7. Return the declared table location
7. Return the declared table location, catalog table properties, and `managed_versioning=false`

**Error Handling:**

Expand All @@ -167,7 +167,8 @@ The implementation:
3. Extract the namespace path from the remaining elements
4. GET `/v1/{prefix}/namespaces/{namespace}/tables`
5. For each table, load its metadata and filter tables where `properties.table_type=lance`
6. Extract table names from the response identifiers
6. If `include_declared=false`, only include catalog entries whose Iceberg metadata location can be opened as a Lance dataset
7. Extract table names from the response identifiers

**Error Handling:**

Expand All @@ -177,7 +178,7 @@ If the server returns an error, return error code `18` (Internal).

### DescribeTable

Retrieves metadata for a Lance table. Only `load_detailed_metadata=false` is supported. When `load_detailed_metadata=false`, only the table location and storage_options are returned; other fields (version, table_uri, schema, stats) are null.
Retrieves metadata for a Lance table. Only `load_detailed_metadata=false` is supported. The response includes the table location, Iceberg table properties as `properties`, and `managed_versioning=false`.

The implementation:

Expand All @@ -187,7 +188,8 @@ The implementation:
4. Extract the table name from the last element
5. GET `/v1/{prefix}/namespaces/{namespace}/tables/{table}`
6. Verify the table has `table_type=lance` property
7. Return the table location and storage_options from `properties`
7. Return the table location and Iceberg table properties
8. If `check_declared=true`, set `is_only_declared=true` when the location cannot be opened as a Lance dataset

**Error Handling:**

Expand All @@ -207,7 +209,8 @@ The implementation:
2. Resolve the API prefix from the warehouse config cache
3. Extract the namespace path from the middle elements
4. Extract the table name from the last element
5. DELETE `/v1/{prefix}/namespaces/{namespace}/tables/{table}?purgeRequested=false`
5. Load the table metadata, then DELETE `/v1/{prefix}/namespaces/{namespace}/tables/{table}?purgeRequested=false`
6. Return the table id, location, and Iceberg table properties

**Error Handling:**

Expand Down
15 changes: 9 additions & 6 deletions docs/src/polaris.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ The **table location** is stored in the `base-location` field of the Generic Tab

## Lance Table Identification

A table in Polaris is identified as a Lance table when it is a Generic Table with `format` set to `lance`. The `base-location` must point to a valid Lance table root directory. The table `properties` should contain `table_type=lance` for consistency with other catalog implementations.
A table in Polaris is identified as a Lance table when it is a Generic Table with `format` set to `lance`. The `base-location` may be declared before a Lance dataset exists. The table `properties` should contain `table_type=lance` for consistency with other catalog implementations.

## Basic Operations

Expand Down Expand Up @@ -140,7 +140,7 @@ The implementation:
- `doc`: optional description from properties
- `properties`: table properties including `table_type=lance`
4. POST to `/api/catalog/polaris/v1/{catalog}/namespaces/{namespace}/generic-tables`
5. Return the created table location and properties
5. Return the created table location, table properties, and `managed_versioning=false`

**Error Handling:**

Expand All @@ -159,7 +159,8 @@ The implementation:
1. Parse the namespace identifier to extract the catalog (first level) and namespace path
2. Validate that at least 2 levels are provided (catalog + namespace)
3. GET `/api/catalog/polaris/v1/{catalog}/namespaces/{namespace}/generic-tables`
4. Extract table names from the response identifiers
4. When `include_declared=true` or unset, extract table names from the response identifiers
5. When `include_declared=false`, load each generic table and only include entries whose `base-location` can be opened as a Lance dataset

**Error Handling:**

Expand All @@ -169,15 +170,16 @@ If the server returns an error, return error code `18` (Internal).

### DescribeTable

Retrieves metadata for a Lance table. Only `load_detailed_metadata=false` is supported. When `load_detailed_metadata=false`, only the table location and storage_options are returned; other fields (version, table_uri, schema, stats) are null.
Retrieves metadata for a Lance table. Only `load_detailed_metadata=false` is supported. The response includes the table location, Polaris table properties as `properties`, and `managed_versioning=false`.

The implementation:

1. Parse the table identifier to extract catalog (first level), namespace (middle levels), and table name (last level)
2. Validate that at least 3 levels are provided (catalog + namespace + table)
3. GET `/api/catalog/polaris/v1/{catalog}/namespaces/{namespace}/generic-tables/{table}`
4. Verify the table format is `lance`
5. Return the table location from `base-location` and storage_options from `properties`
5. Return the table location from `base-location` and Polaris table properties
6. If `check_declared=true`, set `is_only_declared=true` when the location cannot be opened as a Lance dataset

**Error Handling:**

Expand All @@ -195,7 +197,8 @@ The implementation:

1. Parse the table identifier to extract catalog (first level), namespace (middle levels), and table name (last level)
2. Validate that at least 3 levels are provided (catalog + namespace + table)
3. DELETE `/api/catalog/polaris/v1/{catalog}/namespaces/{namespace}/generic-tables/{table}`
3. Load the generic table, then DELETE `/api/catalog/polaris/v1/{catalog}/namespaces/{namespace}/generic-tables/{table}`
4. Return the table id, location, and Polaris table properties

**Error Handling:**

Expand Down
Loading
Loading