Conversation
Watthana65022128
commented
Feb 4, 2026
- REPLACE SQL Query
- Connection String
- charmap error
There was a problem hiding this comment.
Pull request overview
This PR addresses charset encoding issues (specifically "charmap error") in the migration tool by adding support for different character encodings and cleaning problematic characters at the source for SQL Server databases.
Changes:
- Modified SQL query generation to apply REPLACE operations for non-breaking spaces (CHAR(160)) and null bytes (CHAR(0)) when reading from SQL Server
- Added dynamic charset selection UI based on source database type (MySQL: tis620, SQL Server: cp874, PostgreSQL: utf8)
- Updated SQL Server connection string to support legacy Thai databases with TDS version 7.0 for better compatibility
Reviewed changes
Copilot reviewed 2 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| views/migration_engine.py | Added REPLACE operations in SELECT queries for SQL Server and charset selection UI with database-specific options |
| services/db_connector.py | Added TDS version parameter for SQL Server legacy charset support |
| views/pycache/migration_engine.cpython-313.pyc | Compiled Python bytecode (should not be in version control) |
| services/pycache/transformers.cpython-313.pyc | Compiled Python bytecode (should not be in version control) |
| services/pycache/ml_mapper.cpython-313.pyc | Compiled Python bytecode (should not be in version control) |
| services/pycache/db_connector.cpython-313.pyc | Compiled Python bytecode (should not be in version control) |
| pycache/database.cpython-313.pyc | Compiled Python bytecode (should not be in version control) |
| pycache/config.cpython-313.pyc | Compiled Python bytecode (should not be in version control) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # REPLACE(col, CHAR(0), '') -> remove null bytes | ||
| col_expr = f'REPLACE(REPLACE({col_expr}, CHAR(160), \' \'), CHAR(0), \'\')' | ||
|
|
||
| selected_cols.append(f'{col_expr} AS "{source_col}"') |
There was a problem hiding this comment.
SQL injection vulnerability in the AS clause. The source column name is used directly in quotes without sanitization. If a malicious column name contains double quotes or other SQL special characters, this could be exploited. Consider using SQLAlchemy's identifier quoting mechanism or validating/sanitizing column names.
| query_params = {"charset": mssql_charset} | ||
|
|
||
| # For legacy Thai databases, add TDS version for better compatibility | ||
| if charset in ['tis620', 'cp874', 'latin1']: |
There was a problem hiding this comment.
The condition checks if 'charset' is in the list, but 'charset' could be None when default charsets are used (lines 59). This means the TDS version will never be set when using default charset, even if it might be beneficial for older SQL Server versions. Consider checking the actual value of 'mssql_charset' instead of 'charset', or explicitly handle the None case.
| if charset in ['tis620', 'cp874', 'latin1']: | |
| if mssql_charset in ['tis620', 'cp874', 'latin1']: |
| """ | ||
| Generate a SELECT query based on configuration. | ||
| Applies TRIM at source for MSSQL CHAR columns to prevent padding. | ||
| For SQL Server: Also cleans non-breaking spaces and control characters at source. |
There was a problem hiding this comment.
The comment mentions cleaning control characters at source, but the implementation only removes CHAR(0) (null bytes) and CHAR(160) (non-breaking space). Other control characters (CHAR(1)-CHAR(31) excluding newlines/tabs) are not handled. Consider if additional control character cleaning is needed or if the comment should be updated to accurately reflect what is cleaned.
| For SQL Server: Also cleans non-breaking spaces and control characters at source. | |
| For SQL Server: Also replaces non-breaking spaces (CHAR(160)) with regular spaces and removes null bytes (CHAR(0)) at source. |
| else: | ||
| charset_options = [ | ||
| "utf8 (Default)", | ||
| "latin1 (Raw Bytes)" | ||
| ] | ||
| help_text = "เลือก charset ตามฐานข้อมูลต้นทาง" |
There was a problem hiding this comment.
When no source profile is selected (src_db_type is None), the charset selection defaults to "utf8 (Default)" and "latin1 (Raw Bytes)" options. However, the charset_map on line 352 maps "utf8 (Default)" to None, which might not be appropriate for all database types. When a user later selects a source profile (e.g., MySQL), they won't see MySQL-specific options unless they change the selection again. Consider disabling charset selection until a source profile is selected or updating the options dynamically.
|
|
||
| # Apply TRIM if specified in transformers | ||
| if 'TRIM' in mapping.get('transformers', []): | ||
| col_expr = f'TRIM({col_expr})' | ||
|
|
||
| # Clean non-breaking spaces and problematic characters for VARCHAR/NVARCHAR/TEXT columns | ||
| # REPLACE(col, CHAR(160), ' ') -> replace nbsp with regular space | ||
| # REPLACE(col, CHAR(0), '') -> remove null bytes | ||
| col_expr = f'REPLACE(REPLACE({col_expr}, CHAR(160), \' \'), CHAR(0), \'\')' |
There was a problem hiding this comment.
The REPLACE operations are unconditionally applied to all columns for SQL Server, regardless of the column's data type. This could cause issues with numeric, date, or binary columns where REPLACE operations on character data are not appropriate. Consider checking the column data type or restricting this transformation to text/varchar column types only.
| # Apply TRIM if specified in transformers | |
| if 'TRIM' in mapping.get('transformers', []): | |
| col_expr = f'TRIM({col_expr})' | |
| # Clean non-breaking spaces and problematic characters for VARCHAR/NVARCHAR/TEXT columns | |
| # REPLACE(col, CHAR(160), ' ') -> replace nbsp with regular space | |
| # REPLACE(col, CHAR(0), '') -> remove null bytes | |
| col_expr = f'REPLACE(REPLACE({col_expr}, CHAR(160), \' \'), CHAR(0), \'\')' | |
| # Determine if the source column is a text-like type where string REPLACE is appropriate. | |
| # This relies on an optional 'source_type' field in the mapping (e.g. 'VARCHAR(50)', 'NVARCHAR', 'TEXT'). | |
| source_type_raw = str(mapping.get('source_type', '') or '') | |
| source_type_upper = source_type_raw.upper() | |
| text_type_prefixes = ('CHAR', 'NCHAR', 'VARCHAR', 'NVARCHAR') | |
| is_text_column = ( | |
| any(source_type_upper.startswith(prefix) for prefix in text_type_prefixes) | |
| or source_type_upper in ('TEXT', 'NTEXT') | |
| ) | |
| # Apply TRIM if specified in transformers | |
| if 'TRIM' in mapping.get('transformers', []): | |
| col_expr = f'TRIM({col_expr})' | |
| # Clean non-breaking spaces and problematic characters only for text-like columns. | |
| # REPLACE(col, CHAR(160), ' ') -> replace nbsp with regular space | |
| # REPLACE(col, CHAR(0), '') -> remove null bytes | |
| if is_text_column: | |
| col_expr = f'REPLACE(REPLACE({col_expr}, CHAR(160), \' \'), CHAR(0), \'\')' |
|
|
||
| # Special handling for SQL Server text columns | ||
| if db_type == 'Microsoft SQL Server': | ||
| col_expr = f'"{source_col}"' |
There was a problem hiding this comment.
The column identifier uses double quotes directly from the source column name without sanitization. If the source column name contains malicious content (e.g., special characters like double quotes), this could lead to SQL injection. Consider using SQLAlchemy's identifier quoting mechanism or validating/sanitizing column names before using them in queries.
| # Other databases: just apply TRIM if needed | ||
| selected_cols.append(f'TRIM("{source_col}") AS "{source_col}"') | ||
| else: | ||
| selected_cols.append(f'"{source_col}"') |
There was a problem hiding this comment.
Similar to line 72, this column identifier uses double quotes directly from the source column name without sanitization, which could lead to SQL injection if the source column name contains malicious content. Consider using SQLAlchemy's identifier quoting mechanism or validating/sanitizing column names.
| selected_cols.append(f'{col_expr} AS "{source_col}"') | ||
| elif 'TRIM' in mapping.get('transformers', []): | ||
| # Other databases: just apply TRIM if needed | ||
| selected_cols.append(f'TRIM("{source_col}") AS "{source_col}"') |
There was a problem hiding this comment.
Similar SQL injection vulnerability as line 72. The column identifier uses double quotes directly from the source column name in both the TRIM function and the AS clause. Consider using SQLAlchemy's identifier quoting mechanism or validating/sanitizing column names.