Skip to content

Add Sparse Vectors Decoding#97

Open
devsimiyu wants to merge 4 commits into
irfanghat:branch-4.1-stablefrom
devsimiyu:feature/ml-vectors
Open

Add Sparse Vectors Decoding#97
devsimiyu wants to merge 4 commits into
irfanghat:branch-4.1-stablefrom
devsimiyu:feature/ml-vectors

Conversation

@devsimiyu
Copy link
Copy Markdown
Collaborator

Description

Adds initial support for Spark ML linalg vectors on the C++ client. Introduces DenseVector and SparseVector classes under src/ml/linalg/vectors/ and instructs Row::get<T>() to decode Spark's vector UDT struct ({type, size, indices, values}) returned from the server into a typed SparseVector. Also extends dataframe.cpp's value printer to handle INT8 and STRUCT Arrow types so vector columns render in show().

Key Implementation Details

  • SparseVector / DenseVector (src/ml/linalg/vectors/sparse_vector.h, src/ml/linalg/vectors/dense_vector.h)
  • Vector UDT decoding (src/types.h): Added a Row::get<SparseVector>() specialization.
  • Arrow rendering (src/dataframe.cpp): Added INT8 and STRUCT cases to arrayValueToString.

API Changes

  • New public types: SparseVector, DenseVector.
  • New Row::get<SparseVector>(column_name) new template instantiation.

Testing

  • New Integration Test: SparkIntegrationTest.SparkVector (tests/spark/ml/linalg/vectors/sparse_vector.cpp) - verifies the decode + norm path with a unit test that constructs a Row mirroring Spark's vector UDT and asserts the norms.

Why is this change necessary?

DataFrames containing ML feature columns (e.g. output of HashingTF transformer) can be consumed. Sparse vectors are the first piece; DenseVector decoding and other linalg ops will follow.

User-Facing Changes

Does this introduce a user-facing change? Yes.

#include "ml/linalg/vectors/sparse_vector.h"

// Given a DataFrame with a Spark ML vector column "features":
auto rows = df.collect();
SparseVector v = rows[0]->get<SparseVector>("features");

std::cout << "nnz="   << v.numNonzeros() << "\n"
          << "L1="    << v.norm(1)       << "\n"
          << "L2="    << v.norm(2)       << "\n"
          << "Linf="  << v.norm(3)       << "\n"
          << "argmax="<< v.argmax()      << "\n";

@devsimiyu devsimiyu requested a review from irfanghat as a code owner May 23, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant