Feature/pkg config module#86
Open
Marcos-dev41 wants to merge 7 commits into
Open
Conversation
…ts" : "vcpkg" -> "inherits" : "default"] as defined in the cmakepresets.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[SPARK-CONNECT]CPP
Description
Key Implementation Details
Feature/Component Name: REGEX TOKENIZER
Added features for the regex tokenizer to handle ml data that requires separation using various patterns; whitespaces and special characters
Internal Logic:
Fixed the inherits presets by updating it to "default" in the CMakeUserPresets.template.json for local environment as defined in the CMakePresets in configure presets section
API Changes: none
Testing
Made a local test case (test/ml/tokenizertest.cpp) for local tokenization tests to handle both the tokenizer and regex tokenizer
test/ml/tokenizertest.cppRunning main() from /home/mark/vcpkg/buildtrees/gtest/src/v1.17.0-0c449efaff.clean/googletest/src/gtest_main.cc
Note: Google Test filter = TokenizerTest.BasicTransform
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TokenizerTest
[ RUN ] TokenizerTest.BasicTransform
============== GRPC LOGICAL PLAN ==============
root {
common {
plan_id: 0
}
ml_relation {
transform {
transformer {
name: "org.apache.spark.ml.feature.Tokenizer"
uid: "Tokenizer_54ae7692-094a-43b0-bde7-27a6a0257c8d"
type: OPERATOR_TYPE_TRANSFORMER
}
input {
common {
plan_id: 1
}
sql {
query: "SELECT 1 AS id, 'Hello World, foo bar' AS text UNION ALL SELECT 2, 'Apache Spark is great' UNION ALL SELECT 3, 'regex tokenizer test'"
}
}
params {
params {
key: "inputCol"
value {
string: "text"
}
}
params {
key: "outputCol"
value {
string: "tokens"
}
}
}
}
}
}
============== GRPC LOGICAL PLAN ==============
============== GRPC Response Schema ==============
struct {
fields {
name: "id"
data_type {
integer {
}
}
}
fields {
name: "text"
data_type {
string {
collation: "UTF8_BINARY"
}
}
}
fields {
name: "tokens"
data_type {
array {
element_type {
string {
collation: "UTF8_BINARY"
}
}
contains_null: true
}
}
nullable: true
}
}
============== GRPC Response Schema ==============
+----------------------+----------------------+----------------------+
| id | text | tokens |
+----------------------+----------------------+----------------------+
| 1 | Hello World, foo bar | [hello, world,, f... |
| 2 | Apache Spark is g... | [apache, spark, i... |
| 3 | regex tokenizer test | [regex, tokenizer... |
+----------------------+----------------------+----------------------+
[ OK ] TokenizerTest.BasicTransform (1551 ms)
[----------] 1 test from TokenizerTest (1551 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1551 ms total)
[ PASSED ] 1 test.
Test time = 1.63 sec
Test Passed.
"TokenizerTest.BasicTransform" end time: Apr 11 15:45 EAT
"TokenizerTest.BasicTransform" time elapsed: 00:00:01
5/168 Testing: RegexTokenizerTest.BasicTransform
5/168 Test: RegexTokenizerTest.BasicTransform
Command: "/home/mark/ApacheSpark/spark-connect-cpp/build/spark_connect_cpp_test" "--gtest_filter=RegexTokenizerTest.BasicTransform" "--gtest_also_run_disabled_tests"
Directory: /home/mark/ApacheSpark/spark-connect-cpp/build
"RegexTokenizerTest.BasicTransform" start time: Apr 11 15:45 EAT
Output:
Running main() from /home/mark/vcpkg/buildtrees/gtest/src/v1.17.0-0c449efaff.clean/googletest/src/gtest_main.cc
Note: Google Test filter = RegexTokenizerTest.BasicTransform
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from RegexTokenizerTest
[ RUN ] RegexTokenizerTest.BasicTransform
============== GRPC LOGICAL PLAN ==============
root {
common {
plan_id: 0
}
ml_relation {
transform {
transformer {
name: "org.apache.spark.ml.feature.RegexTokenizer"
uid: "Tokenizer_f8ccc70a-5ba2-468b-9737-4d5b0e89d846"
type: OPERATOR_TYPE_TRANSFORMER
}
input {
common {
plan_id: 1
}
sql {
query: "SELECT 1 AS id, 'Hello Wor!ld$ foo_bar' AS text UNION ALL SELECT 2, 'Apache Spark is great' UNION ALL SELECT 3, 'regex tokenizer test'"
}
}
params {
params {
key: "inputCol"
value {
string: "text"
}
}
params {
key: "outputCol"
value {
string: "tokens"
}
}
params {
key: "pattern"
value {
string: "\W+"
}
}
}
}
}
}
============== GRPC LOGICAL PLAN ==============
============== GRPC Response Schema ==============
struct {
fields {
name: "id"
data_type {
integer {
}
}
}
fields {
name: "text"
data_type {
string {
collation: "UTF8_BINARY"
}
}
}
fields {
name: "tokens"
data_type {
array {
element_type {
string {
collation: "UTF8_BINARY"
}
}
contains_null: true
}
}
nullable: true
}
}
============== GRPC Response Schema ==============
+----------------------+----------------------+----------------------+
| id | text | tokens |
+----------------------+----------------------+----------------------+
| 1 | Hello Wor!ld$ foo... | [hello, wor, ld, ... |
| 2 | Apache Spark is g... | [apache, spark, i... |
| 3 | regex tokenizer test | [regex, tokenizer... |
+----------------------+----------------------+----------------------+
[ OK ] RegexTokenizerTest.BasicTransform (712 ms)
[----------] 1 test from RegexTokenizerTest (712 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (712 ms total)
[ PASSED ] 1 test.
Test time = 0.80 sec
Test Passed.
Why is this change necessary?
The need to handle pattern tokenization for ml workflows
User-Facing Changes
Does this introduce a user-facing change? (Yes/No)
If yes, provide a code snippet of the new functionality:
// Example usage here