Skip to content

Word2Vec estimator and ParamMap foundation#95

Open
devsimiyu wants to merge 2 commits into
irfanghat:branch-4.1-stablefrom
devsimiyu:feature/ml-feature
Open

Word2Vec estimator and ParamMap foundation#95
devsimiyu wants to merge 2 commits into
irfanghat:branch-4.1-stablefrom
devsimiyu:feature/ml-feature

Conversation

@devsimiyu
Copy link
Copy Markdown
Collaborator

Description

Adds initial support for the ML feature surface: the Word2Vec estimator (fit) and the Word2VecModel transformer (transform, getVectors), built on a new variant-based ParamMap and a templated Param<T> that together mirror Spark's Scala ml.param API. Also wires up the Spark Connect ML protobufs and moves the existing ML feature tests under the tests/spark/ integration suite.

Key Implementation Details

  • Param<T> (src/ml/param/param.h): Templated descriptor carrying (parent, name, doc, validator). key() = parent + "__" + name is used as the internal storage key in ParamMap. Ships two reusable validators (param_is_valid_number, param_is_valid_string).
  • ParamMap (src/ml/param/param_map.{h,cpp}): Non-templated std::map<std::string, ParamValue> where ParamValue = std::variant<std::string, int, int64_t, double, bool>. Exposes both a typed Param<T>-keyed API (put/get/getOrElse/contains/remove) and a string-keyed API for ad-hoc access. Replaces the prior param_map_proto.h shim.
  • to_ml_params wire serialization: Converts a ParamMap into a spark.connect.MlParams proto. The internal key is parent__name, but the wire map is keyed by the bare param name (the parent uid is implicit in the estimator the params attach to) — stripping the parent__ prefix at serialization time so the server can resolve each param against the estimator's definitions.
  • Word2Vec (src/ml/feature/word_2_vec.{h,cpp}): Estimator with the full Spark param surface (inputCol, outputCol, vectorSize, minCount, numPartitions, maxIter, windowSize, maxSentenceLength, stepSize, seed). fit() builds a Plan → Command → MlCommand → Fit request, sends the estimator's uid_ (used as the param parent prefix at construction) as the operator uid, and returns a Word2VecModel carrying the server-side ObjectRef.
  • Word2VecModel (src/ml/feature/word_2_vec_model.{h,cpp}): Transformer with transform(DataFrame) (builds an MlRelation.Transform referencing the model ObjectRef) and getVectors() (builds an MlRelation.Fetch with method getVectors).
  • Tokenizer (src/ml/feature/tokenizer.{h,cpp}): Updated to use the new Param<T> / ParamMap types instead of the prior proto-backed map.
  • Test layout: Moves tests/databricks/ml/feature/{tokenizer,word_2_vec,word_2_vec_model}.cpp to tests/spark/ml/feature/, since these are Spark Connect integration tests rather than Databricks-specific. Adds tests/spark/ml/feature/params_clear.cpp to cover clear(name) on Word2Vec.

Testing

  • New Integration Test: SparkIntegrationTest.Word2VecFit
  • New Integration Test: SparkIntegrationTest.Word2VecTransform
  • New Integration Test: SparkIntegrationTest.Word2VecGetVectors
  • New Integration Test: SparkIntegrationTest.TokenizerTransform
  • New Integration Test: SparkIntegrationTest.Word2VecParamsClear
  • Manual verification via Spark Connect server logs (confirmed estimator uid matches param parent, and that MlParams arrives keyed by bare param names — fixes "Param Word2Vec___inputCol does not exist" server-side rejection)
  • Memory leak check (Valgrind/ASAN) — not yet run

Why is this change necessary?

To reach feature parity with the PySpark / Scala ML APIs for Word2Vec via Spark Connect, and to establish the Param<T> / ParamMap primitives that subsequent estimators and transformers in spark-connect-cpp will reuse. A typed param system is a prerequisite for any non-trivial ML operator, since each estimator needs to declare its hyperparameters and serialize them safely onto an MlParams proto.

User-Facing Changes

Does this introduce a user-facing change? Yes

#include "dataframe.h"
#include "ml/feature/word_2_vec.h"
#include "ml/feature/word_2_vec_model.h"

auto df = spark->sql(R"(
    SELECT *
    FROM VALUES
        (ARRAY('spark', 'connect', 'is', 'fast')),
        (ARRAY('word', 'to', 'vec', 'is', 'cool')),
        (ARRAY('cpp', 'client', 'for', 'spark'))
    AS sentences(words)
)");

Word2Vec word2vec;
word2vec.set_input_col("words");
word2vec.set_output_col("features");
word2vec.set_vector_size(3);
word2vec.set_min_count(1);

Word2VecModel model = word2vec.fit(df);

DataFrame vectors = model.getVectors();
DataFrame transformed = model.transform(df);

…sformer, with support for Params and ParamMap to represent ML Params
@devsimiyu devsimiyu requested a review from irfanghat as a code owner May 12, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant