Word2Vec estimator and ParamMap foundation#95
Open
devsimiyu wants to merge 2 commits into
Open
Conversation
…ansform initial implementation
…sformer, with support for Params and ParamMap to represent ML Params
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds initial support for the ML feature surface: the
Word2Vecestimator (fit) and theWord2VecModeltransformer (transform, getVectors), built on a new variant-basedParamMapand a templatedParam<T>that together mirror Spark's Scalaml.paramAPI. Also wires up the Spark Connect ML protobufs and moves the existing ML feature tests under thetests/spark/integration suite.Key Implementation Details
Param<T>(src/ml/param/param.h): Templated descriptor carrying(parent, name, doc, validator).key() = parent + "__" + nameis used as the internal storage key inParamMap. Ships two reusable validators (param_is_valid_number,param_is_valid_string).ParamMap(src/ml/param/param_map.{h,cpp}): Non-templatedstd::map<std::string, ParamValue>whereParamValue = std::variant<std::string, int, int64_t, double, bool>. Exposes both a typedParam<T>-keyed API (put/get/getOrElse/contains/remove) and a string-keyed API for ad-hoc access. Replaces the priorparam_map_proto.hshim.to_ml_paramswire serialization: Converts aParamMapinto aspark.connect.MlParamsproto. The internal key isparent__name, but the wire map is keyed by the bare param name (the parent uid is implicit in the estimator the params attach to) — stripping theparent__prefix at serialization time so the server can resolve each param against the estimator's definitions.Word2Vec(src/ml/feature/word_2_vec.{h,cpp}): Estimator with the full Spark param surface (inputCol,outputCol,vectorSize,minCount,numPartitions,maxIter,windowSize,maxSentenceLength,stepSize,seed).fit()builds aPlan → Command → MlCommand → Fitrequest, sends the estimator'suid_(used as the param parent prefix at construction) as the operator uid, and returns aWord2VecModelcarrying the server-sideObjectRef.Word2VecModel(src/ml/feature/word_2_vec_model.{h,cpp}): Transformer withtransform(DataFrame)(builds anMlRelation.Transformreferencing the modelObjectRef) andgetVectors()(builds anMlRelation.Fetchwith methodgetVectors).Tokenizer(src/ml/feature/tokenizer.{h,cpp}): Updated to use the newParam<T>/ParamMaptypes instead of the prior proto-backed map.tests/databricks/ml/feature/{tokenizer,word_2_vec,word_2_vec_model}.cpptotests/spark/ml/feature/, since these are Spark Connect integration tests rather than Databricks-specific. Addstests/spark/ml/feature/params_clear.cppto coverclear(name)on Word2Vec.Testing
SparkIntegrationTest.Word2VecFitSparkIntegrationTest.Word2VecTransformSparkIntegrationTest.Word2VecGetVectorsSparkIntegrationTest.TokenizerTransformSparkIntegrationTest.Word2VecParamsClearMlParamsarrives keyed by bare param names — fixes "Param Word2Vec___inputCol does not exist" server-side rejection)Why is this change necessary?
To reach feature parity with the PySpark / Scala ML APIs for Word2Vec via Spark Connect, and to establish the
Param<T>/ParamMapprimitives that subsequent estimators and transformers inspark-connect-cppwill reuse. A typed param system is a prerequisite for any non-trivial ML operator, since each estimator needs to declare its hyperparameters and serialize them safely onto anMlParamsproto.User-Facing Changes
Does this introduce a user-facing change? Yes