Feature/pkg config module by Marcos-dev41 · Pull Request #86 · irfanghat/spark-connect-cpp

Marcos-dev41 · 2026-04-16T20:27:27Z

[SPARK-CONNECT]CPP

Description

Created a Regex Tokenizer to handle ml data which requires tokenization under certain specific settable patterns
Fixed the inherits configure preset in CMakeUserPresets.template.json ["inherits": "vcpkg" -> "inherits" : "default"] as defined in the CMakePresets.json

Key Implementation Details

Feature/Component Name: REGEX TOKENIZER
Added features for the regex tokenizer to handle ml data that requires separation using various patterns; whitespaces and special characters
Internal Logic:
Fixed the inherits presets by updating it to "default" in the CMakeUserPresets.template.json for local environment as defined in the CMakePresets in configure presets section
API Changes: none

Testing

Made a local test case (test/ml/tokenizertest.cpp) for local tokenization tests to handle both the tokenizer and regex tokenizer

New Integration Test: test/ml/tokenizertest.cpp
Manual verification via Spark Connect server logs
Running main() from /home/mark/vcpkg/buildtrees/gtest/src/v1.17.0-0c449efaff.clean/googletest/src/gtest_main.cc
Note: Google Test filter = TokenizerTest.BasicTransform
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TokenizerTest
[ RUN ] TokenizerTest.BasicTransform
============== GRPC LOGICAL PLAN ==============
root {
common {
plan_id: 0
}
ml_relation {
transform {
transformer {
name: "org.apache.spark.ml.feature.Tokenizer"
uid: "Tokenizer_54ae7692-094a-43b0-bde7-27a6a0257c8d"
type: OPERATOR_TYPE_TRANSFORMER
}
input {
common {
plan_id: 1
}
sql {
query: "SELECT 1 AS id, 'Hello World, foo bar' AS text UNION ALL SELECT 2, 'Apache Spark is great' UNION ALL SELECT 3, 'regex tokenizer test'"
}
}
params {
params {
key: "inputCol"
value {
string: "text"
}
}
params {
key: "outputCol"
value {
string: "tokens"
}
}
}
}
}
}
============== GRPC LOGICAL PLAN ==============
============== GRPC Response Schema ==============
struct {
fields {
name: "id"
data_type {
integer {
}
}
}
fields {
name: "text"
data_type {
string {
collation: "UTF8_BINARY"
}
}
}
fields {
name: "tokens"
data_type {
array {
element_type {
string {
collation: "UTF8_BINARY"
}
}
contains_null: true
}
}
nullable: true
}
}
============== GRPC Response Schema ==============
+----------------------+----------------------+----------------------+
| id | text | tokens |
+----------------------+----------------------+----------------------+
| 1 | Hello World, foo bar | [hello, world,, f... |
| 2 | Apache Spark is g... | [apache, spark, i... |
| 3 | regex tokenizer test | [regex, tokenizer... |
+----------------------+----------------------+----------------------+
[ OK ] TokenizerTest.BasicTransform (1551 ms)
[----------] 1 test from TokenizerTest (1551 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1551 ms total)
[ PASSED ] 1 test.

Test time = 1.63 sec

Test Passed.
"TokenizerTest.BasicTransform" end time: Apr 11 15:45 EAT
"TokenizerTest.BasicTransform" time elapsed: 00:00:01

5/168 Testing: RegexTokenizerTest.BasicTransform
5/168 Test: RegexTokenizerTest.BasicTransform
Command: "/home/mark/ApacheSpark/spark-connect-cpp/build/spark_connect_cpp_test" "--gtest_filter=RegexTokenizerTest.BasicTransform" "--gtest_also_run_disabled_tests"
Directory: /home/mark/ApacheSpark/spark-connect-cpp/build
"RegexTokenizerTest.BasicTransform" start time: Apr 11 15:45 EAT
Output:

Running main() from /home/mark/vcpkg/buildtrees/gtest/src/v1.17.0-0c449efaff.clean/googletest/src/gtest_main.cc
Note: Google Test filter = RegexTokenizerTest.BasicTransform
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from RegexTokenizerTest
[ RUN ] RegexTokenizerTest.BasicTransform
============== GRPC LOGICAL PLAN ==============
root {
common {
plan_id: 0
}
ml_relation {
transform {
transformer {
name: "org.apache.spark.ml.feature.RegexTokenizer"
uid: "Tokenizer_f8ccc70a-5ba2-468b-9737-4d5b0e89d846"
type: OPERATOR_TYPE_TRANSFORMER
}
input {
common {
plan_id: 1
}
sql {
query: "SELECT 1 AS id, 'Hello Wor!ld$ foo_bar' AS text UNION ALL SELECT 2, 'Apache Spark is great' UNION ALL SELECT 3, 'regex tokenizer test'"
}
}
params {
params {
key: "inputCol"
value {
string: "text"
}
}
params {
key: "outputCol"
value {
string: "tokens"
}
}
params {
key: "pattern"
value {
string: "\W+"
}
}
}
}
}
}
============== GRPC LOGICAL PLAN ==============
============== GRPC Response Schema ==============
struct {
fields {
name: "id"
data_type {
integer {
}
}
}
fields {
name: "text"
data_type {
string {
collation: "UTF8_BINARY"
}
}
}
fields {
name: "tokens"
data_type {
array {
element_type {
string {
collation: "UTF8_BINARY"
}
}
contains_null: true
}
}
nullable: true
}
}
============== GRPC Response Schema ==============
+----------------------+----------------------+----------------------+
| id | text | tokens |
+----------------------+----------------------+----------------------+
| 1 | Hello Wor!ld$ foo... | [hello, wor, ld, ... |
| 2 | Apache Spark is g... | [apache, spark, i... |
| 3 | regex tokenizer test | [regex, tokenizer... |
+----------------------+----------------------+----------------------+
[ OK ] RegexTokenizerTest.BasicTransform (712 ms)
[----------] 1 test from RegexTokenizerTest (712 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (712 ms total)
[ PASSED ] 1 test.

Test time = 0.80 sec

Test Passed.

Memory leak check (Valgrind/ASAN)

Why is this change necessary?

The need to handle pattern tokenization for ml workflows

User-Facing Changes

Does this introduce a user-facing change? (Yes/No)
If yes, provide a code snippet of the new functionality:

// Example usage here

…ts" : "vcpkg" -> "inherits" : "default"] as defined in the cmakepresets.json

Marcos-dev41 added 7 commits April 16, 2026 22:08

updated spark and delta spark versions

89e1e53

Configured debugger using CMake

79fcb9a

Created Regex Tokenizer for ml data

d75283e

Library file for Regextokenizer cpp file

80f39fa

Local test for the tokenizer and regextokenizer

1e8f131

Removed 'launch.json', a configured debugger made with cmake

e57a1e9

Fixed the configure preset in CMakeUserPresets.template.json ["inheri…

bbaa2a0

…ts" : "vcpkg" -> "inherits" : "default"] as defined in the cmakepresets.json

Marcos-dev41 marked this pull request as ready for review April 16, 2026 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/pkg config module#86

Feature/pkg config module#86
Marcos-dev41 wants to merge 7 commits into
irfanghat:feature/pkg-config-modulefrom
Marcos-dev41:feature/pkg-config-module

Marcos-dev41 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Marcos-dev41 commented Apr 16, 2026

[SPARK-CONNECT]CPP

Description

Key Implementation Details

Testing

[----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (1551 ms total) [ PASSED ] 1 test. Test time = 1.63 sec

Test Passed. "TokenizerTest.BasicTransform" end time: Apr 11 15:45 EAT "TokenizerTest.BasicTransform" time elapsed: 00:00:01

[----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (712 ms total) [ PASSED ] 1 test. Test time = 0.80 sec

Why is this change necessary?

User-Facing Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1551 ms total)
[ PASSED ] 1 test.

Test time = 1.63 sec

Test Passed.
"TokenizerTest.BasicTransform" end time: Apr 11 15:45 EAT
"TokenizerTest.BasicTransform" time elapsed: 00:00:01

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (712 ms total)
[ PASSED ] 1 test.

Test time = 0.80 sec