Skip to content

Feature/chamap#4

Open
Watthana65022128 wants to merge 3 commits intomainfrom
feature/chamap
Open

Feature/chamap#4
Watthana65022128 wants to merge 3 commits intomainfrom
feature/chamap

Conversation

@Watthana65022128
Copy link
Collaborator

  • REPLACE SQL Query
  • Connection String
  • charmap error

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses charset encoding issues (specifically "charmap error") in the migration tool by adding support for different character encodings and cleaning problematic characters at the source for SQL Server databases.

Changes:

  • Modified SQL query generation to apply REPLACE operations for non-breaking spaces (CHAR(160)) and null bytes (CHAR(0)) when reading from SQL Server
  • Added dynamic charset selection UI based on source database type (MySQL: tis620, SQL Server: cp874, PostgreSQL: utf8)
  • Updated SQL Server connection string to support legacy Thai databases with TDS version 7.0 for better compatibility

Reviewed changes

Copilot reviewed 2 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
views/migration_engine.py Added REPLACE operations in SELECT queries for SQL Server and charset selection UI with database-specific options
services/db_connector.py Added TDS version parameter for SQL Server legacy charset support
views/pycache/migration_engine.cpython-313.pyc Compiled Python bytecode (should not be in version control)
services/pycache/transformers.cpython-313.pyc Compiled Python bytecode (should not be in version control)
services/pycache/ml_mapper.cpython-313.pyc Compiled Python bytecode (should not be in version control)
services/pycache/db_connector.cpython-313.pyc Compiled Python bytecode (should not be in version control)
pycache/database.cpython-313.pyc Compiled Python bytecode (should not be in version control)
pycache/config.cpython-313.pyc Compiled Python bytecode (should not be in version control)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# REPLACE(col, CHAR(0), '') -> remove null bytes
col_expr = f'REPLACE(REPLACE({col_expr}, CHAR(160), \' \'), CHAR(0), \'\')'

selected_cols.append(f'{col_expr} AS "{source_col}"')
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL injection vulnerability in the AS clause. The source column name is used directly in quotes without sanitization. If a malicious column name contains double quotes or other SQL special characters, this could be exploited. Consider using SQLAlchemy's identifier quoting mechanism or validating/sanitizing column names.

Copilot uses AI. Check for mistakes.
query_params = {"charset": mssql_charset}

# For legacy Thai databases, add TDS version for better compatibility
if charset in ['tis620', 'cp874', 'latin1']:
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks if 'charset' is in the list, but 'charset' could be None when default charsets are used (lines 59). This means the TDS version will never be set when using default charset, even if it might be beneficial for older SQL Server versions. Consider checking the actual value of 'mssql_charset' instead of 'charset', or explicitly handle the None case.

Suggested change
if charset in ['tis620', 'cp874', 'latin1']:
if mssql_charset in ['tis620', 'cp874', 'latin1']:

Copilot uses AI. Check for mistakes.
"""
Generate a SELECT query based on configuration.
Applies TRIM at source for MSSQL CHAR columns to prevent padding.
For SQL Server: Also cleans non-breaking spaces and control characters at source.
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions cleaning control characters at source, but the implementation only removes CHAR(0) (null bytes) and CHAR(160) (non-breaking space). Other control characters (CHAR(1)-CHAR(31) excluding newlines/tabs) are not handled. Consider if additional control character cleaning is needed or if the comment should be updated to accurately reflect what is cleaned.

Suggested change
For SQL Server: Also cleans non-breaking spaces and control characters at source.
For SQL Server: Also replaces non-breaking spaces (CHAR(160)) with regular spaces and removes null bytes (CHAR(0)) at source.

Copilot uses AI. Check for mistakes.
Comment on lines +334 to +339
else:
charset_options = [
"utf8 (Default)",
"latin1 (Raw Bytes)"
]
help_text = "เลือก charset ตามฐานข้อมูลต้นทาง"
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no source profile is selected (src_db_type is None), the charset selection defaults to "utf8 (Default)" and "latin1 (Raw Bytes)" options. However, the charset_map on line 352 maps "utf8 (Default)" to None, which might not be appropriate for all database types. When a user later selects a source profile (e.g., MySQL), they won't see MySQL-specific options unless they change the selection again. Consider disabling charset selection until a source profile is selected or updating the options dynamically.

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +81

# Apply TRIM if specified in transformers
if 'TRIM' in mapping.get('transformers', []):
col_expr = f'TRIM({col_expr})'

# Clean non-breaking spaces and problematic characters for VARCHAR/NVARCHAR/TEXT columns
# REPLACE(col, CHAR(160), ' ') -> replace nbsp with regular space
# REPLACE(col, CHAR(0), '') -> remove null bytes
col_expr = f'REPLACE(REPLACE({col_expr}, CHAR(160), \' \'), CHAR(0), \'\')'
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The REPLACE operations are unconditionally applied to all columns for SQL Server, regardless of the column's data type. This could cause issues with numeric, date, or binary columns where REPLACE operations on character data are not appropriate. Consider checking the column data type or restricting this transformation to text/varchar column types only.

Suggested change
# Apply TRIM if specified in transformers
if 'TRIM' in mapping.get('transformers', []):
col_expr = f'TRIM({col_expr})'
# Clean non-breaking spaces and problematic characters for VARCHAR/NVARCHAR/TEXT columns
# REPLACE(col, CHAR(160), ' ') -> replace nbsp with regular space
# REPLACE(col, CHAR(0), '') -> remove null bytes
col_expr = f'REPLACE(REPLACE({col_expr}, CHAR(160), \' \'), CHAR(0), \'\')'
# Determine if the source column is a text-like type where string REPLACE is appropriate.
# This relies on an optional 'source_type' field in the mapping (e.g. 'VARCHAR(50)', 'NVARCHAR', 'TEXT').
source_type_raw = str(mapping.get('source_type', '') or '')
source_type_upper = source_type_raw.upper()
text_type_prefixes = ('CHAR', 'NCHAR', 'VARCHAR', 'NVARCHAR')
is_text_column = (
any(source_type_upper.startswith(prefix) for prefix in text_type_prefixes)
or source_type_upper in ('TEXT', 'NTEXT')
)
# Apply TRIM if specified in transformers
if 'TRIM' in mapping.get('transformers', []):
col_expr = f'TRIM({col_expr})'
# Clean non-breaking spaces and problematic characters only for text-like columns.
# REPLACE(col, CHAR(160), ' ') -> replace nbsp with regular space
# REPLACE(col, CHAR(0), '') -> remove null bytes
if is_text_column:
col_expr = f'REPLACE(REPLACE({col_expr}, CHAR(160), \' \'), CHAR(0), \'\')'

Copilot uses AI. Check for mistakes.

# Special handling for SQL Server text columns
if db_type == 'Microsoft SQL Server':
col_expr = f'"{source_col}"'
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column identifier uses double quotes directly from the source column name without sanitization. If the source column name contains malicious content (e.g., special characters like double quotes), this could lead to SQL injection. Consider using SQLAlchemy's identifier quoting mechanism or validating/sanitizing column names before using them in queries.

Copilot uses AI. Check for mistakes.
# Other databases: just apply TRIM if needed
selected_cols.append(f'TRIM("{source_col}") AS "{source_col}"')
else:
selected_cols.append(f'"{source_col}"')
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to line 72, this column identifier uses double quotes directly from the source column name without sanitization, which could lead to SQL injection if the source column name contains malicious content. Consider using SQLAlchemy's identifier quoting mechanism or validating/sanitizing column names.

Copilot uses AI. Check for mistakes.
selected_cols.append(f'{col_expr} AS "{source_col}"')
elif 'TRIM' in mapping.get('transformers', []):
# Other databases: just apply TRIM if needed
selected_cols.append(f'TRIM("{source_col}") AS "{source_col}"')
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar SQL injection vulnerability as line 72. The column identifier uses double quotes directly from the source column name in both the TRIM function and the AS clause. Consider using SQLAlchemy's identifier quoting mechanism or validating/sanitizing column names.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants