Skip to content

Support large tables #13

@vicaya

Description

@vicaya

The current backend implement convert the entire table to python list before paging, making it only useful for small/demo tables.

Here's a proposal to fix the pagination performance issue by implementing native LanceDB pagination:

Proposed Solution: Native LanceDB Pagination

Replace the current full-table scan approach with LanceDB's built-in pagination methods to read only the required rows from disk.

Implementation Changes

Current problematic code in backend/app.py:

# Lines 301-307: Loads entire table then paginates in memory
data_list = table.to_arrow().to_pylist()
total_count = len(data_list)
start_idx = offset
end_idx = min(offset + limit, total_count)
paginated_data = data_list[start_idx:end_idx]

Proposed replacement:

# Use LanceDB's native take() method for pagination
try:
    # Get total count efficiently without loading data
    total_count = table.count_rows()
    
    # Apply pagination at the LanceDB level
    if offset >= total_count:
        result_table = pa.Table.from_pylist([])
    else:
        # Use take() with slice for efficient pagination
        indices = list(range(offset, min(offset + limit, total_count)))
        result_table = table.take(indices).to_arrow()
        
except Exception as e:
    logger.error(f"Native pagination failed for {dataset_name}: {e}")
    # Fallback to current method if native pagination fails
    data_list = table.to_arrow().to_pylist()
    total_count = len(data_list)
    start_idx = offset
    end_idx = min(offset + limit, total_count)
    paginated_data = data_list[start_idx:end_idx]
    # Convert back to Arrow table...

Key Benefits

  1. Memory Efficiency: Only loads requested rows into RAM instead of entire dataset
  2. Disk I/O Reduction: Reads only necessary data pages from storage
  3. Faster Response Times: Eliminates full table scan for each pagination request
  4. Scalability: Works efficiently with datasets containing millions of rows

Implementation Strategy

  1. Primary Method: Use table.count_rows() for total count and table.take(indices) for paginated data
  2. Fallback: Keep current implementation as backup for compatibility with older Lance versions
  3. Column Filtering: Apply column selection after pagination to minimize data transfer

Frontend Compatibility

No changes needed - the frontend already sends correct pagination parameters and will benefit from faster response times.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions