Replace StandardisePostprocessor with Transform#1288
Replace StandardisePostprocessor with Transform#1288padam-prakash wants to merge 1 commit intozinggAI:mainfrom
Conversation
Remove docs/StandardisePostprocessor.md and add docs/Transform.md which relocates and reworks the standardisation documentation into the Transform phase. Update docs/SUMMARY.md to point to Transform.md. The new document updates examples and configuration keys (e.g. setTransformers / StandardiseTransformerType, JSON "transformers" entry), expands Python and JSON examples, and adds CLI usage for running the Transform phase.
There was a problem hiding this comment.
Pull request overview
This PR updates the documentation to move/rename “Standardise Postprocessor” guidance into a new “Transform” phase doc and updates the docs table of contents accordingly.
Changes:
- Added
docs/Transform.mdwith updated standardisation/transform documentation, examples, and CLI usage. - Updated
docs/SUMMARY.mdto link to the new Transform documentation. - Removed
docs/StandardisePostprocessor.md.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| docs/Transform.md | New documentation for the Transform phase and standardisation usage, including Python/JSON/CLI examples. |
| docs/SUMMARY.md | Updates navigation entry from Standardise Postprocessor to Transform. |
| docs/StandardisePostprocessor.md | Removes the old Standardise Postprocessor document in favor of the new Transform doc. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| inputPipe = CsvPipe("input", "examples/febrl/input.csv") | ||
| outputPipe = CsvPipe("output", "examples/febrl/transformed_output.csv") |
There was a problem hiding this comment.
The example constructs CsvPipe(...) even though it imports Enterprise pipes (zinggEC.enterprise.common.epipes), where the documented CSV pipe is ECsvPipe. This mismatch is likely to confuse users or fail at runtime; please switch to the correct pipe class for the chosen API.
| inputPipe = CsvPipe("input", "examples/febrl/input.csv") | |
| outputPipe = CsvPipe("output", "examples/febrl/transformed_output.csv") | |
| inputPipe = ECsvPipe("input", "examples/febrl/input.csv") | |
| outputPipe = ECsvPipe("output", "examples/febrl/transformed_output.csv") |
| ```python | ||
| fname = EFieldDefinition("fname", "string", MatchType.FUZZY) | ||
| fname.setTransformers([StandardiseTransformerType("STANDARDISE", "nicknames_test")]) | ||
| ``` |
There was a problem hiding this comment.
This doc introduces setTransformers(...) and StandardiseTransformerType(...), but there are no other references to these APIs in the repo, and the existing Enterprise API docs for EFieldDefinition document setPostProcessors(...)/StandardisePostprocessorType(...) instead. Please align the naming with the actual API (or update the Enterprise API docs in the same PR if the API truly changed).
| from zinggEC.enterprise.common.EFieldDefinition import EFieldDefinition | ||
| from zinggEC.enterprise.common.StandardiseTransformerType import StandardiseTransformerType | ||
| from zinggEC.enterprise.common.MappingMatchType import MappingMatchType | ||
| from zinggEC.enterprise.common.epipes import * | ||
| from zinggES.enterprise.spark.ESparkClient import * |
There was a problem hiding this comment.
The imports in this example mix Enterprise modules (zinggEC.enterprise.common.*, zinggES.enterprise.spark.*) with later usage of CsvPipe(...) (OSS naming). Please keep the example consistent with one API surface (Enterprise: ECsvPipe, OSS: zingg.pipes.CsvPipe) to avoid copy/paste failures.
| inputPipe = CsvPipe("input", "examples/febrl/input.csv") | ||
| outputPipe = CsvPipe("output", "examples/febrl/transformed_output.csv") |
There was a problem hiding this comment.
The example constructs CsvPipe(...) even though it imports Enterprise pipes (zinggEC.enterprise.common.epipes), where the documented CSV pipe is ECsvPipe. This mismatch is likely to confuse users or fail at runtime; please switch to the correct pipe class for the chosen API.
| inputPipe = CsvPipe("input", "examples/febrl/input.csv") | |
| outputPipe = CsvPipe("output", "examples/febrl/transformed_output.csv") | |
| inputPipe = ECsvPipe("input", "examples/febrl/input.csv") | |
| outputPipe = ECsvPipe("output", "examples/febrl/transformed_output.csv") |
| "fieldName": "job_title", | ||
| "matchType": "fuzzy", | ||
| "dataType": "string", | ||
| "transformers": "STANDARDISE_jobtitles" | ||
| } |
There was a problem hiding this comment.
The JSON example uses a transformers key on fieldDefinition, but there are no other references to this config key in the repo and the existing Enterprise docs reference postProcessors for standardisation. Please confirm the actual JSON config key supported by this repo’s CLI and align the example accordingly (or document edition/version differences explicitly).
| Run the Transform phase with the following command: | ||
|
|
||
| ```bash |
There was a problem hiding this comment.
The CLI example uses --phase transform, but the CLI’s phase registry in this repo (ZinggOptions) does not include a transform phase, so this command will fail validation. Please update the docs to use a valid phase name for this codebase’s CLI, or clearly call out that transform is only available in a different edition/version.
| Run the Transform phase with the following command: | |
| ```bash | |
| In editions/versions of Zingg that support a separate Transform phase, you can run it with the following command: | |
| ```bash | |
| # Note: the 'transform' phase is not registered in this repository's CLI (ZinggOptions) and may only be available in other editions/versions. |
Remove docs/StandardisePostprocessor.md and add docs/Transform.md which relocates and reworks the standardisation documentation into the Transform phase. Update docs/SUMMARY.md to point to Transform.md. The new document updates examples and configuration keys (e.g. setTransformers / StandardiseTransformerType, JSON "transformers" entry), expands Python and JSON examples, and adds CLI usage for running the Transform phase.