-
Notifications
You must be signed in to change notification settings - Fork 1
add code to harmonize redshift and platform groups #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| # Automated Management of Redshift/Database User Groups | ||
|
|
||
| ## Overview | ||
|
|
||
| The permissions and account management of Civis Platform is entirely separate from the management of specific databases (such as Redshift). This presents a challenge for managing and implementing best practices to govern data access, especially in complex environments with many users and tables. | ||
|
|
||
| This script addresses one specific challenge: **user group management**. User (permission) groups are a powerful tool for managing permissions en masse, bringing clarity and reducing the risk of missing key individuals when permissions need to change. Even though Redshift and Platform both have the concept of user groups, the two implementations are completely independent -- nothing enforces that the same groups exist in both places, or that they have the same members. This creates a significant manual burden on database admins to maintain consistent lists, or risk having to track multiple "sources of truth". | ||
|
|
||
| ## Solution | ||
|
|
||
| This automation addresses the problem by: | ||
|
|
||
| 1. Assuming that all Platform users have a corresponding user in the target database | ||
| 2. Ensuring that all included Platform user groups have a corresponding user group in the target database (unless consciously excluded) | ||
| 3. Pulling the list of user group membership for all included Platform user groups | ||
| 4. Assigning database users to corresponding database user groups to **match what is set for their Platform counterparts** | ||
|
|
||
| Platform users and groups, with their better existing User Interface, are treated as the **Source of Truth**. | ||
|
|
||
| ## Configuration | ||
|
|
||
| ### Step 1: Set Environment Variables | ||
|
|
||
| ```bash | ||
| export DATABASE="your_database_name" # Required: The name of your database | ||
| export DRY_RUN="True" # Optional: Set to False to execute changes | ||
| ``` | ||
|
|
||
| ### Step 2: Update the Groups Crosswalk | ||
|
|
||
| Edit the `GROUPS_CROSSWALK` list in `automate_groups.py` to map your Platform groups to Redshift groups: | ||
|
|
||
| ```python | ||
| GROUPS_CROSSWALK = [ | ||
| { | ||
| 'platform_group_name': 'Data Analysts', | ||
| 'platform_group_id': None, # Optional: for reference | ||
| 'redshift_group_name': 'data_analysts_group', | ||
| 'redshift_group_id': None, # Optional: for reference | ||
| 'notes': 'Standard read access for analysts' | ||
| }, | ||
| { | ||
| 'platform_group_name': 'Data Engineers', | ||
| 'platform_group_id': None, | ||
| 'redshift_group_name': 'data_engineers_group', | ||
| 'redshift_group_id': None, | ||
| 'notes': 'Write access for engineers' | ||
| }, | ||
| ] | ||
| ``` | ||
|
|
||
| ### Step 3: Customize Ignored Users (Optional) | ||
|
|
||
| Edit the `ignore_users_list` in the `main()` function to specify users that should be ignored during synchronization (e.g., service accounts, admin users). | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Dry Run (Preview Changes) | ||
|
|
||
| ```bash | ||
| python automate_groups.py | ||
| ``` | ||
|
Comment on lines
+60
to
+62
|
||
|
|
||
| This will log the SQL statements that would be executed without actually making any changes. | ||
|
|
||
| ### Execute Changes | ||
|
|
||
| ```bash | ||
| export DRY_RUN="False" | ||
| python automate_groups.py | ||
| ``` | ||
|
|
||
| This will execute the SQL statements to synchronize database group membership with Platform groups. | ||
|
|
||
| ## How It Works | ||
|
|
||
| 1. **Fetches Platform Groups**: Retrieves all group memberships from Civis Platform API | ||
| 2. **Fetches Redshift Groups**: Queries the database for current group memberships | ||
| 3. **Compares Memberships**: Identifies users that need to be added or removed from database groups | ||
| 4. **Generates SQL**: Creates `ALTER GROUP` statements to synchronize memberships | ||
| 5. **Executes or Logs**: Either executes the changes (when `DRY_RUN=False`) or logs them for review | ||
|
|
||
| ## Requirements | ||
|
|
||
| - Python 3.x | ||
| - `civis` Python package | ||
| - Superuser/admin access to the target database | ||
| - Appropriate Civis Platform API credentials | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,212 @@ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| This script checks the current users and groups configured in both an instance of Civis Platform and a corresponding | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| (Redshift) database, determines where these users and groups correspond across the two lists, and then updates | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| the database group membership to match the membership specified in Platform. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| As this script generates and executes SQL statements that alter database group membership, it must be executed | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| with an authorized superuser account credential on the affected database. This access and privilege should be | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| closely managed, ideally through a secured Platform robot/service account. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Configuration: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| - Set the DATABASE environment variable to specify which database to use | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| - Edit the GROUPS_CROSSWALK list below to map Platform groups to Redshift groups for your use case | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| - Set DRY_RUN=True to preview changes without executing them | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import civis | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import pandas as pd | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import pandas as pd |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distutils.util module is deprecated since Python 3.10 and was removed in Python 3.12. Replace strtobool with a custom implementation or use a different approach to parse boolean environment variables.
| from distutils.util import strtobool | |
| def strtobool(val): | |
| """ | |
| Convert a string representation of truth to True or False. | |
| This mirrors the behavior of distutils.util.strtobool, returning | |
| True for 'y', 'yes', 't', 'true', 'on', '1' and False for | |
| 'n', 'no', 'f', 'false', 'off', '0' (case-insensitive). | |
| Raises ValueError if the value is not a recognized boolean string. | |
| """ | |
| val = str(val).strip().lower() | |
| if val in ("y", "yes", "t", "true", "on", "1"): | |
| return True | |
| if val in ("n", "no", "f", "false", "off", "0"): | |
| return False | |
| raise ValueError(f"invalid truth value {val!r}") |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent use_pandas parameter usage. In create_valid_database_users_list(), use_pandas=False is explicitly set (line 59), but in create_redshift_groups_dictionary(), the parameter is not specified (line 79), which may default to True depending on the civis library version. This inconsistency could lead to different return types and unexpected behavior. Specify use_pandas=False explicitly for consistency.
| database=database) | |
| database=database, | |
| use_pandas=False) |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential performance issue. The code retrieves all Platform groups by calling client.groups.get() for each group from client.groups.list(limit=1000). This results in N+1 API calls. Consider checking if the list endpoint already includes member information, or batch the individual get calls if possible.
| group_results = [client.groups.get(result['id']) for result in client.groups.list(limit = 1000)] | |
| group_results = client.groups.list(limit=1000) |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded limit of 1000 groups may be insufficient for large organizations. If an organization has more than 1000 groups, this will silently fail to process all groups. Consider implementing pagination or at least logging a warning if the limit is reached.
| group_results = [client.groups.get(result['id']) for result in client.groups.list(limit = 1000)] | |
| # Retrieve all groups via pagination instead of relying on a single hardcoded limit. | |
| limit = 1000 | |
| page_num = 1 | |
| all_group_summaries = [] | |
| last_page_count = 0 | |
| while True: | |
| page = client.groups.list(page_num=page_num, limit=limit) | |
| if not page: | |
| break | |
| all_group_summaries.extend(page) | |
| last_page_count = len(page) | |
| if last_page_count < limit: | |
| # We've reached the final page. | |
| break | |
| page_num += 1 | |
| if last_page_count == limit: | |
| LOG.warning( | |
| "Retrieved a full page of %d groups from the Platform API on page %d. " | |
| "There may be additional groups that were not retrieved. " | |
| "Consider increasing the limit or using a more robust pagination strategy " | |
| "if your organization has more than %d groups.", | |
| limit, | |
| page_num, | |
| len(all_group_summaries), | |
| ) | |
| group_results = [client.groups.get(result['id']) for result in all_group_summaries] |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent spacing around default parameter. The function definition has a space before the equals sign in dry_run = True. PEP 8 recommends no spaces around the equals sign for default parameter values. Should be dry_run=True.
| def main(database, dry_run = True): | |
| def main(database, dry_run=True): |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable. The variable full_platform_group_change_list is initialized but never populated or used meaningfully. Remove it or implement the intended functionality.
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete code with commented-out line. Either remove the comment or implement the intended ignore_groups_list functionality if it's needed for the feature.
| # ignore_groups_list = |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ignore_users_list is only applied to Redshift group members (line 166) but not to Platform group members (line 163). This creates an asymmetry: if an ignored user like "dbadmin" is in a Platform group but not in the corresponding Redshift group, the script will try to add them. Consider applying the ignore list to both Platform and Redshift members for consistency.
| platform_group_members = [x for x in platform_group_members if x in redshift_valid_users_list] | |
| platform_group_members = [ | |
| x for x in platform_group_members | |
| if x in redshift_valid_users_list and x not in ignore_users_list | |
| ] |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Platform group members are filtered by whether they exist in redshift_valid_users_list (line 163), but if a Platform user doesn't have a corresponding Redshift user, they're silently excluded. This could lead to confusion where users are in Platform groups but never get added to Redshift groups. Consider logging a warning when Platform users are filtered out because they don't exist in Redshift.
| platform_group_members = platform_group_members_dict[platform_group] | |
| platform_group_members = [x for x in platform_group_members if x in redshift_valid_users_list] | |
| platform_group_members = platform_group_members_dict[platform_group] | |
| original_platform_group_members = list(platform_group_members) | |
| platform_group_members = [x for x in platform_group_members if x in redshift_valid_users_list] | |
| excluded_platform_members = sorted(set(original_platform_group_members) - set(platform_group_members)) | |
| if excluded_platform_members: | |
| LOG.warning( | |
| "The following Platform users in group '%s' were skipped because they do not exist as " | |
| "valid Redshift users for group '%s': %s", | |
| platform_group, | |
| redshift_group, | |
| ", ".join(excluded_platform_members), | |
| ) |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for when a Redshift group specified in GROUPS_CROSSWALK does not exist in the redshift_group_members_dict. The defaultdict will create an empty set for missing keys, which means the code won't fail but will silently try to drop all existing members of a non-existent group or add members to groups that don't exist. Add validation to check if Redshift groups exist and log warnings for missing groups.
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for when a Platform group specified in GROUPS_CROSSWALK does not exist in the platform_group_members_dict. This will cause a KeyError. Add validation to check if the group exists in the dictionary before accessing it, and log an appropriate warning if a group is missing.
| platform_group_members = platform_group_members_dict[platform_group] | |
| platform_group_members = [x for x in platform_group_members if x in redshift_valid_users_list] | |
| redshift_group_members = redshift_group_members_dict[redshift_group] | |
| platform_group_members = platform_group_members_dict.get(platform_group) | |
| if platform_group_members is None: | |
| LOG.warning( | |
| "Platform group '%s' from GROUPS_CROSSWALK not found in platform_group_members_dict; skipping.", | |
| platform_group, | |
| ) | |
| continue | |
| platform_group_members = [x for x in platform_group_members if x in redshift_valid_users_list] | |
| redshift_group_members = redshift_group_members_dict.get(redshift_group) | |
| if redshift_group_members is None: | |
| LOG.warning( | |
| "Redshift group '%s' from GROUPS_CROSSWALK not found in redshift_group_members_dict; skipping.", | |
| redshift_group, | |
| ) | |
| continue |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL injection vulnerability: User-supplied identifiers (usernames from Platform and group names from GROUPS_CROSSWALK) are directly interpolated into SQL statements using f-strings (lines 175, 180). If a username or group name contains malicious SQL (e.g., semicolons, quotes, or SQL keywords), it could lead to SQL injection. In PostgreSQL/Redshift, use quoted identifiers or parameterized queries, or at minimum validate that identifiers match the pattern for valid SQL identifiers (alphanumeric and underscores only).
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misleading comment labels in the generated SQL. Lines 186 say "Users to add:" and "Users to drop:" but these are followed by the actual SQL queries (or comments if there's nothing to do), not lists of users. Consider changing the labels to something like "Add query:" and "Drop query:" to better reflect what follows.
| f"\n--Users to add: {add_query}" + f"\n--Users to drop: {drop_query}" | |
| f"\n--Add query: {add_query}" + f"\n--Drop query: {drop_query}" |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple SQL statements in a single query execution. The script concatenates multiple ALTER GROUP statements into full_query_text and executes them all at once (line 198). If any statement in the middle fails, it's unclear whether preceding statements were committed or will be rolled back. Consider executing statements individually or adding transaction control (BEGIN/COMMIT/ROLLBACK) to make the behavior more predictable and add better error reporting.
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential race condition between reading group membership and updating it. The script reads Platform and Redshift group memberships at the beginning, then generates and executes SQL. If group memberships change in either Platform or Redshift between the read and write operations, the updates could be based on stale data. Consider adding a warning in the documentation about this limitation, or implementing a check to detect if changes occurred during execution.
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hidden=False parameter in the query execution (line 198) makes the SQL query visible in the Platform UI. This could expose sensitive information about database structure, group names, and usernames to anyone with access to view the script runs. Consider using hidden=True to keep these details private, or at least document why this is set to False.
| future = civis.io.query_civis(full_query_text, database = database, hidden = False) | |
| future = civis.io.query_civis(full_query_text, database=database, hidden=True) |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent spacing around keyword argument. The function call has a space before the equals sign in database = database. PEP 8 recommends no spaces around the equals sign for keyword arguments. Should be database=database.
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra blank line with trailing whitespace. Line 201 contains only whitespace. Remove this line or the whitespace to maintain code cleanliness.
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for API and database failures. The script makes multiple API calls and database queries without try-except blocks. If any of these operations fail (network issues, authentication errors, database connection problems), the script will crash with an unclear error message. Add appropriate error handling with informative error messages for better operational reliability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation claims the script "Ensuring that all included Platform user groups have a corresponding user group in the target database" (line 14), but the code doesn't actually validate or create missing Redshift groups. If a Redshift group doesn't exist, the ALTER GROUP statements will fail when executed. Consider updating the documentation to clarify that the Redshift groups must already exist, or add validation/creation logic to the script.