Skip to content

Feature/pkg config module#86

Open
Marcos-dev41 wants to merge 7 commits into
irfanghat:feature/pkg-config-modulefrom
Marcos-dev41:feature/pkg-config-module
Open

Feature/pkg config module#86
Marcos-dev41 wants to merge 7 commits into
irfanghat:feature/pkg-config-modulefrom
Marcos-dev41:feature/pkg-config-module

Conversation

@Marcos-dev41

Copy link
Copy Markdown

[SPARK-CONNECT]CPP

Description

  1. Created a Regex Tokenizer to handle ml data which requires tokenization under certain specific settable patterns
  2. Fixed the inherits configure preset in CMakeUserPresets.template.json ["inherits": "vcpkg" -> "inherits" : "default"] as defined in the CMakePresets.json

Key Implementation Details

  • Feature/Component Name: REGEX TOKENIZER
    Added features for the regex tokenizer to handle ml data that requires separation using various patterns; whitespaces and special characters

  • Internal Logic:
    Fixed the inherits presets by updating it to "default" in the CMakeUserPresets.template.json for local environment as defined in the CMakePresets in configure presets section

  • API Changes: none

Testing

Made a local test case (test/ml/tokenizertest.cpp) for local tokenization tests to handle both the tokenizer and regex tokenizer

  • New Integration Test: test/ml/tokenizertest.cpp
  • Manual verification via Spark Connect server logs
    Running main() from /home/mark/vcpkg/buildtrees/gtest/src/v1.17.0-0c449efaff.clean/googletest/src/gtest_main.cc
    Note: Google Test filter = TokenizerTest.BasicTransform
    [==========] Running 1 test from 1 test suite.
    [----------] Global test environment set-up.
    [----------] 1 test from TokenizerTest
    [ RUN ] TokenizerTest.BasicTransform
    ============== GRPC LOGICAL PLAN ==============
    root {
    common {
    plan_id: 0
    }
    ml_relation {
    transform {
    transformer {
    name: "org.apache.spark.ml.feature.Tokenizer"
    uid: "Tokenizer_54ae7692-094a-43b0-bde7-27a6a0257c8d"
    type: OPERATOR_TYPE_TRANSFORMER
    }
    input {
    common {
    plan_id: 1
    }
    sql {
    query: "SELECT 1 AS id, 'Hello World, foo bar' AS text UNION ALL SELECT 2, 'Apache Spark is great' UNION ALL SELECT 3, 'regex tokenizer test'"
    }
    }
    params {
    params {
    key: "inputCol"
    value {
    string: "text"
    }
    }
    params {
    key: "outputCol"
    value {
    string: "tokens"
    }
    }
    }
    }
    }
    }
    ============== GRPC LOGICAL PLAN ==============
    ============== GRPC Response Schema ==============
    struct {
    fields {
    name: "id"
    data_type {
    integer {
    }
    }
    }
    fields {
    name: "text"
    data_type {
    string {
    collation: "UTF8_BINARY"
    }
    }
    }
    fields {
    name: "tokens"
    data_type {
    array {
    element_type {
    string {
    collation: "UTF8_BINARY"
    }
    }
    contains_null: true
    }
    }
    nullable: true
    }
    }
    ============== GRPC Response Schema ==============
    +----------------------+----------------------+----------------------+
    | id | text | tokens |
    +----------------------+----------------------+----------------------+
    | 1 | Hello World, foo bar | [hello, world,, f... |
    | 2 | Apache Spark is g... | [apache, spark, i... |
    | 3 | regex tokenizer test | [regex, tokenizer... |
    +----------------------+----------------------+----------------------+
    [ OK ] TokenizerTest.BasicTransform (1551 ms)
    [----------] 1 test from TokenizerTest (1551 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1551 ms total)
[ PASSED ] 1 test.

Test time = 1.63 sec

Test Passed.
"TokenizerTest.BasicTransform" end time: Apr 11 15:45 EAT
"TokenizerTest.BasicTransform" time elapsed: 00:00:01

5/168 Testing: RegexTokenizerTest.BasicTransform
5/168 Test: RegexTokenizerTest.BasicTransform
Command: "/home/mark/ApacheSpark/spark-connect-cpp/build/spark_connect_cpp_test" "--gtest_filter=RegexTokenizerTest.BasicTransform" "--gtest_also_run_disabled_tests"
Directory: /home/mark/ApacheSpark/spark-connect-cpp/build
"RegexTokenizerTest.BasicTransform" start time: Apr 11 15:45 EAT
Output:

Running main() from /home/mark/vcpkg/buildtrees/gtest/src/v1.17.0-0c449efaff.clean/googletest/src/gtest_main.cc
Note: Google Test filter = RegexTokenizerTest.BasicTransform
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from RegexTokenizerTest
[ RUN ] RegexTokenizerTest.BasicTransform
============== GRPC LOGICAL PLAN ==============
root {
common {
plan_id: 0
}
ml_relation {
transform {
transformer {
name: "org.apache.spark.ml.feature.RegexTokenizer"
uid: "Tokenizer_f8ccc70a-5ba2-468b-9737-4d5b0e89d846"
type: OPERATOR_TYPE_TRANSFORMER
}
input {
common {
plan_id: 1
}
sql {
query: "SELECT 1 AS id, 'Hello Wor!ld$ foo_bar' AS text UNION ALL SELECT 2, 'Apache Spark is great' UNION ALL SELECT 3, 'regex tokenizer test'"
}
}
params {
params {
key: "inputCol"
value {
string: "text"
}
}
params {
key: "outputCol"
value {
string: "tokens"
}
}
params {
key: "pattern"
value {
string: "\W+"
}
}
}
}
}
}
============== GRPC LOGICAL PLAN ==============
============== GRPC Response Schema ==============
struct {
fields {
name: "id"
data_type {
integer {
}
}
}
fields {
name: "text"
data_type {
string {
collation: "UTF8_BINARY"
}
}
}
fields {
name: "tokens"
data_type {
array {
element_type {
string {
collation: "UTF8_BINARY"
}
}
contains_null: true
}
}
nullable: true
}
}
============== GRPC Response Schema ==============
+----------------------+----------------------+----------------------+
| id | text | tokens |
+----------------------+----------------------+----------------------+
| 1 | Hello Wor!ld$ foo... | [hello, wor, ld, ... |
| 2 | Apache Spark is g... | [apache, spark, i... |
| 3 | regex tokenizer test | [regex, tokenizer... |
+----------------------+----------------------+----------------------+
[ OK ] RegexTokenizerTest.BasicTransform (712 ms)
[----------] 1 test from RegexTokenizerTest (712 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (712 ms total)
[ PASSED ] 1 test.

Test time = 0.80 sec

Test Passed.

  • Memory leak check (Valgrind/ASAN)

Why is this change necessary?

The need to handle pattern tokenization for ml workflows

User-Facing Changes

Does this introduce a user-facing change? (Yes/No)
If yes, provide a code snippet of the new functionality:

// Example usage here

@Marcos-dev41 Marcos-dev41 marked this pull request as ready for review April 16, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant