diff --git a/README.fr.md b/README.fr.md index 9cf0d55..e88b440 100644 --- a/README.fr.md +++ b/README.fr.md @@ -193,7 +193,7 @@ Pas encore de compte cloud ? `cleancloud demo` affiche un exemple de sortie sans - **Détection du gaspillage IA/ML sur les 3 clouds :** endpoints, notebooks, Studio apps et training jobs SageMaker ; clusters AML Compute et instances ML ; endpoints en ligne Azure ML et services Azure AI Search ; endpoints, instances Workbench et training jobs Vertex AI. Les ressources GPU sont mises en avant comme candidats de revue à risque plus élevé. Les outils natifs n'indiquent pas toujours quoi examiner — CleanCloud le fait. Opt-in via `--category ai` - **Gouvernance policy-as-code :** `cleancloud.yaml` pour la configuration par règle, les exceptions avec dates d'expiration, les seuils de coût et de confiance, les exclusions par tag — versionné aux côtés de votre infrastructure. Chaque exception est une approbation auditée dans git. - **Application de politique (opt-in) :** `--fail-on-confidence HIGH` ou `--fail-on-cost 500` — appliquer des seuils de gaspillage en CI/CD sur un planning, géré par les équipes platform ou FinOps -- **45 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe +- **46 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe - **Scan multi-comptes (AWS) :** scannez des AWS Organizations entières en une exécution — fichier de config, IDs inline, ou auto-découverte via `--org` - **Scan multi-abonnements (Azure) :** scannez tous les abonnements Azure en parallèle — auto-découverte via Management Group, détail des coûts par abonnement inclus - **Scan multi-projets (GCP) :** scannez tous les projets GCP accessibles en parallèle — auto-découverte via Application Default Credentials, détail des coûts par projet inclus @@ -278,7 +278,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l | Cluster AML Compute Azure (GPU) | 600 – 15 000 $ / mois | | Instance de calcul Azure ML (GPU) | 600 – 15 000+ $ / mois | | Endpoint en ligne Azure ML (GPU) | 200 – 2 600+ $ / mois | -| Azure AI Search (Standard+) | 261 – 4 028+ $ / mois | +| Azure AI Search (Basic+) | 261 – 4 028+ $ / mois | | Déploiement Azure OpenAI Provisionné (PTU) | 1 460+ $ / PTU / mois | | Endpoint Vertex AI Online Prediction (GPU) | 449 – 23 000+ $ / mois | | Instance Vertex AI Workbench (GPU) | 449 – 8 000+ $ / mois | @@ -528,7 +528,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud ## Ce que CleanCloud détecte -45 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC. +46 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC. **AWS :** - Compute : instances arrêtées 30+ jours (charges EBS continuent) @@ -545,7 +545,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud - Réseau : adresses IP publiques inutilisées, Load Balancers vides (HIGH), App Gateways vides (HIGH), VNet Gateways inactives - Plateforme : App Service Plans vides (HIGH), bases de données SQL inactives (HIGH), App Services inactifs, Container Registries inutilisés - Gouvernance : ressources sans tags -- IA/ML *(opt-in : `--category ai`)* : clusters de calcul AML avec capacité baseline non nulle et aucune activité depuis 14+ jours — clusters GPU flaggés risque HIGH ($600–$15K/mois) ; instances de calcul Azure ML Running sans activité depuis 14+ jours — instances GPU flaggées risque CRITICAL ($600–$15K+/mois) ; endpoints en ligne ML managés sans requête de scoring depuis 7+ jours — endpoints GPU flaggés HIGH/CRITICAL (200–2 600+$/mois) ; services AI Search (Standard+) sans requête depuis 30+ jours — facturés par SKU × réplicas × partitions (261–4 028+$/mois) ; déploiements Azure OpenAI provisionnés (PTUs) sans requête API depuis 7+ jours — facturés ~1 460 $/PTU/mois en on-demand quel que soit le trafic +- IA/ML *(opt-in : `--category ai`)* : clusters de calcul AML avec capacité baseline non nulle et aucune activité depuis 14+ jours — clusters GPU flaggés risque HIGH ($600–$15K/mois) ; instances de calcul Azure ML Running sans activité depuis 14+ jours — instances GPU flaggées risque CRITICAL ($600–$15K+/mois) ; endpoints en ligne ML managés sans requête de scoring depuis 7+ jours — endpoints GPU flaggés HIGH/CRITICAL (200–2 600+$/mois) ; services AI Search (Basic+) sans requête depuis 90+ jours — facturés par SKU × réplicas × partitions (261–4 028+$/mois) ; déploiements Azure OpenAI provisionnés (PTUs) sans requête API depuis 7+ jours — facturés ~1 460 $/PTU/mois en on-demand quel que soit le trafic **GCP :** - Compute : instances VM arrêtées 30+ jours (charges disque continuent) (HIGH) diff --git a/README.md b/README.md index 7fbb704..923a86f 100644 --- a/README.md +++ b/README.md @@ -193,7 +193,7 @@ No cloud account yet? `cleancloud demo` shows sample output without any credenti - **AI/ML waste detection across all 3 clouds:** idle SageMaker endpoints, notebook instances, Studio apps, and long-running training jobs; AML compute clusters and instances; Azure ML online endpoints and AI Search services; Vertex AI endpoints, Workbench instances, and training jobs. GPU-backed resources are highlighted as higher-risk review candidates. Native cost tools don't surface these — CleanCloud does. Opt-in via `--category ai` - **Policy-as-code governance:** `cleancloud.yaml` for per-rule config, exceptions with expiry dates, cost and confidence thresholds, tag-based exclusions — version-controlled alongside your infrastructure. Every exception is a git-reviewable approval. - **Governance enforcement (opt-in):** `--fail-on-confidence HIGH` or `--fail-on-cost 500` — enforce waste thresholds in CI/CD on a schedule, owned by platform or FinOps teams -- **45 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate +- **46 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate - **Multi-account scanning (AWS):** scan entire AWS Organizations in one run — config file, inline IDs, or auto-discovery via `--org` - **Multi-subscription scanning (Azure):** scan all Azure subscriptions in parallel — auto-discovery via Management Group, per-subscription cost breakdown included - **Multi-project scanning (GCP):** scan all accessible GCP projects in parallel — auto-discovery via Application Default Credentials, per-project cost breakdown included @@ -278,7 +278,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend | Azure AML compute cluster (GPU) | $600 – $15,000 / month | | Azure ML Compute Instance (GPU) | $600 – $15,000+ / month | | Azure ML Online Endpoint (GPU-backed) | $200 – $2,600+ / month | -| Azure AI Search (Standard+) | $261 – $4,028+ / month | +| Azure AI Search (Basic+) | $261 – $4,028+ / month | | Azure OpenAI Provisioned Deployment (PTU) | $1,460+ / PTU / month | | Vertex AI Online Prediction endpoint (GPU) | $449 – $23,000+ / month | | Vertex AI Workbench instance (GPU) | $449 – $8,000+ / month | @@ -528,7 +528,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints ## What CleanCloud Detects -45 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments. +46 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments. **AWS:** - Compute: stopped instances 30+ days (EBS charges continue) @@ -545,7 +545,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints - Network: unused public IPs, empty load balancers (HIGH), empty App Gateways (HIGH), idle VNet Gateways - Platform: empty App Service Plans (HIGH), idle SQL databases (HIGH), idle App Services, unused Container Registries - Governance: untagged resources -- AI/ML *(opt-in: `--category ai`)*: idle AML compute clusters with non-zero baseline capacity and no workload activity 14+ days — GPU clusters flagged HIGH risk ($600–$15K/month); idle Compute Instances with no control-plane activity 14+ days — GPU instances CRITICAL risk ($600–$15K+/month); idle ML managed online endpoints with zero scoring requests 7+ days — GPU-backed endpoints flagged HIGH/CRITICAL ($200–$2,600+/month); idle AI Search services (Standard+) with zero queries 30+ days — billed per SKU × replicas × partitions ($261–$4,028+/month); idle Azure OpenAI provisioned deployments (PTUs) with zero API requests 7+ days — bills ~$1,460/PTU/month on-demand regardless of traffic +- AI/ML *(opt-in: `--category ai`)*: idle AML compute clusters with non-zero baseline capacity and no workload activity 14+ days — GPU clusters flagged HIGH risk ($600–$15K/month); idle Compute Instances with no control-plane activity 14+ days — GPU instances CRITICAL risk ($600–$15K+/month); idle ML managed online endpoints with zero scoring requests 7+ days — GPU-backed endpoints flagged HIGH/CRITICAL ($200–$2,600+/month); idle AI Search services (Basic+) with zero queries 90+ days — billed per SKU × replicas × partitions ($261–$4,028+/month); idle Azure OpenAI provisioned deployments (PTUs) with zero API requests 7+ days — bills ~$1,460/PTU/month on-demand regardless of traffic **GCP:** - Compute: stopped instances 30+ days (disk charges continue) (HIGH) diff --git a/cleancloud/doctor/aws.py b/cleancloud/doctor/aws.py index af973a6..69e9ba8 100644 --- a/cleancloud/doctor/aws.py +++ b/cleancloud/doctor/aws.py @@ -230,6 +230,7 @@ def run_aws_doctor(profile: Optional[str], region: Optional[str] = None) -> None info("Permissions required (attach to your IAM role or user):") info(" ec2:DescribeVolumes") info(" ec2:DescribeSnapshots") + info(" ec2:DescribeSnapshotAttribute") info(" ec2:DescribeRegions") info(" ec2:DescribeAddresses") info(" ec2:DescribeNetworkInterfaces") @@ -239,6 +240,8 @@ def run_aws_doctor(profile: Optional[str], region: Optional[str] = None) -> None info(" ec2:DescribeSecurityGroups") info(" rds:DescribeDBInstances") info(" rds:DescribeDBSnapshots") + info(" rds:DescribeDBSnapshotAttributes") + info(" cloudtrail:LookupEvents") info(" elasticloadbalancing:DescribeLoadBalancers") info(" elasticloadbalancing:DescribeTargetGroups") info(" logs:DescribeLogGroups") @@ -409,6 +412,22 @@ def run_aws_doctor(profile: Optional[str], region: Optional[str] = None) -> None permissions_failed.append(("ec2:DescribeSnapshots", str(e))) warn(f"ec2:DescribeSnapshots - {e}") + try: + _snaps = ec2.describe_snapshots(OwnerIds=["self"], MaxResults=5).get("Snapshots", []) + if _snaps: + ec2.describe_snapshot_attribute( + SnapshotId=_snaps[0]["SnapshotId"], Attribute="createVolumePermission" + ) + permissions_tested.append("ec2:DescribeSnapshotAttribute") + success("ec2:DescribeSnapshotAttribute") + except Exception as e: + if "AccessDenied" in str(e) or "not authorized" in str(e).lower(): + permissions_failed.append(("ec2:DescribeSnapshotAttribute", str(e))) + warn(f"ec2:DescribeSnapshotAttribute - {e}") + else: + permissions_tested.append("ec2:DescribeSnapshotAttribute") + success("ec2:DescribeSnapshotAttribute") + try: ec2.describe_regions() permissions_tested.append("ec2:DescribeRegions") @@ -483,6 +502,24 @@ def run_aws_doctor(profile: Optional[str], region: Optional[str] = None) -> None permissions_failed.append(("rds:DescribeDBSnapshots", str(e))) warn(f"rds:DescribeDBSnapshots - {e}") + try: + _rds_snaps = rds.describe_db_snapshots(MaxRecords=20, SnapshotType="manual").get( + "DBSnapshots", [] + ) + if _rds_snaps: + rds.describe_db_snapshot_attributes( + DBSnapshotIdentifier=_rds_snaps[0]["DBSnapshotIdentifier"] + ) + permissions_tested.append("rds:DescribeDBSnapshotAttributes") + success("rds:DescribeDBSnapshotAttributes") + except Exception as e: + if "AccessDenied" in str(e) or "not authorized" in str(e).lower(): + permissions_failed.append(("rds:DescribeDBSnapshotAttributes", str(e))) + warn(f"rds:DescribeDBSnapshotAttributes - {e}") + else: + permissions_tested.append("rds:DescribeDBSnapshotAttributes") + success("rds:DescribeDBSnapshotAttributes") + # Test ELB permissions try: elbv2 = session.client("elbv2", region_name=region) @@ -563,6 +600,24 @@ def run_aws_doctor(profile: Optional[str], region: Optional[str] = None) -> None permissions_failed.append(("s3:GetBucketTagging", str(e))) warn(f"s3:GetBucketTagging - {e}") + # Test CloudTrail permissions (aws.ec2.instance.stopped — stopped-duration probe) + try: + from datetime import datetime, timedelta + from datetime import timezone as _tz + + cloudtrail = session.client("cloudtrail", region_name=region) + _now = datetime.now(_tz.utc) + cloudtrail.lookup_events( + StartTime=_now - timedelta(hours=1), + EndTime=_now, + MaxResults=1, + ) + permissions_tested.append("cloudtrail:LookupEvents") + success("cloudtrail:LookupEvents") + except Exception as e: + permissions_failed.append(("cloudtrail:LookupEvents", str(e))) + warn(f"cloudtrail:LookupEvents - {e}") + except Exception: fail("CleanCloud cannot run safely with missing read-only permissions") diff --git a/cleancloud/doctor/azure.py b/cleancloud/doctor/azure.py index 34f1b7d..9b9f86d 100644 --- a/cleancloud/doctor/azure.py +++ b/cleancloud/doctor/azure.py @@ -223,6 +223,7 @@ def run_azure_doctor() -> None: info(" Microsoft.Compute/disks/read") info(" Microsoft.Compute/snapshots/read") info(" Microsoft.Compute/virtualMachines/read") + info(" Microsoft.Compute/virtualMachines/instanceView/action") info(" Microsoft.Network/publicIPAddresses/read") info(" Microsoft.Network/loadBalancers/read") info(" Microsoft.Network/applicationGateways/read") @@ -231,6 +232,7 @@ def run_azure_doctor() -> None: info(" Microsoft.Web/serverfarms/read") info(" Microsoft.Web/serverfarms/sites/read") info(" Microsoft.Web/sites/read") + info(" Microsoft.Web/sites/webJobs/read") info(" Microsoft.ContainerRegistry/registries/read") info(" Microsoft.Sql/servers/read") info(" Microsoft.Sql/servers/databases/read") @@ -296,6 +298,7 @@ def run_azure_doctor() -> None: info(" - Microsoft.Compute/disks/read") info(" - Microsoft.Compute/snapshots/read") info(" - Microsoft.Compute/virtualMachines/read") + info(" - Microsoft.Compute/virtualMachines/instanceView/action") info(" - Microsoft.Network/publicIPAddresses/read") info(" - Microsoft.Network/loadBalancers/read") info(" - Microsoft.Network/applicationGateways/read") @@ -304,6 +307,7 @@ def run_azure_doctor() -> None: info(" - Microsoft.Web/serverfarms/read") info(" - Microsoft.Web/serverfarms/sites/read") info(" - Microsoft.Web/sites/read") + info(" - Microsoft.Web/sites/webJobs/read") info(" - Microsoft.ContainerRegistry/registries/read") info(" - Microsoft.Sql/servers/read") info(" - Microsoft.Sql/servers/databases/read") @@ -320,6 +324,64 @@ def run_azure_doctor() -> None: info(" - Microsoft.Search/searchServices/read") info(" - Microsoft.Insights/metrics/read") + # Probe the two permissions that custom CleanCloudReadOnly roles historically omitted. + # Reader (built-in) includes these; custom least-privilege roles may not. + info("") + info("Step 5: Runtime Permission Probes (custom-role gap check)") + info("-" * 70) + + from azure.mgmt.compute import ComputeManagementClient + from azure.mgmt.web import WebSiteManagementClient + + compute_client = ComputeManagementClient(credential, subscriptions[0].subscription_id) + web_client = WebSiteManagementClient(credential, subscriptions[0].subscription_id) + + # Probe: Microsoft.Compute/virtualMachines/instanceView/action + # Required by azure.vm.stopped_not_deallocated — reads PowerState from instance view statuses. + try: + _vms = list(compute_client.virtual_machines.list_all()) + _first_vm = next(iter(_vms), None) + if _first_vm: + _rg = _first_vm.id.split("/")[ + _first_vm.id.lower().split("/").index("resourcegroups") + 1 + ] + compute_client.virtual_machines.get(_rg, _first_vm.name, expand="instanceView") + success("Microsoft.Compute/virtualMachines/instanceView/action") + else: + info( + "Microsoft.Compute/virtualMachines/instanceView/action — not tested (no VMs found to probe)" + ) + except Exception as e: + if "AuthorizationFailed" in str(e) or "403" in str(e): + warn(f"Microsoft.Compute/virtualMachines/instanceView/action — DENIED: {e}") + warn( + " azure.vm.stopped_not_deallocated will skip all VMs — add this action to your custom role" + ) + else: + info(f"Microsoft.Compute/virtualMachines/instanceView/action — could not probe: {e}") + + # Probe: Microsoft.Web/sites/webJobs/read + # Required by azure.app_service.idle — enumerates WebJobs to avoid false positives. + try: + _sites = list(web_client.web_apps.list()) + _first_site = next(iter(_sites), None) + if _first_site: + _rg = _first_site.id.split("/")[ + _first_site.id.lower().split("/").index("resourcegroups") + 1 + ] + list(web_client.web_apps.list_web_jobs(_rg, _first_site.name)) + success("Microsoft.Web/sites/webJobs/read") + else: + info("Microsoft.Web/sites/webJobs/read — not tested (no App Services found to probe)") + except Exception as e: + if "AuthorizationFailed" in str(e) or "403" in str(e): + warn(f"Microsoft.Web/sites/webJobs/read — DENIED: {e}") + warn( + " azure.app_service.idle will skip all App Services — add this action to your custom role" + ) + else: + info(f"Microsoft.Web/sites/webJobs/read — could not probe: {e}") + # Summary info("") info("=" * 70) @@ -417,12 +479,13 @@ def run_azure_ai_doctor(subscription_id: str = None) -> None: ) # Check: Microsoft.MachineLearningServices/workspaces/onlineEndpoints/read (list endpoints) + _endpoints = [] if workspaces: try: ws = workspaces[0] rg = ws.id.split("/")[ws.id.lower().split("/").index("resourcegroups") + 1] # Attempt to list online endpoints to validate permission - list(ml_client.online_endpoints.list(rg, ws.name)) + _endpoints = list(ml_client.online_endpoints.list(rg, ws.name)) permissions_tested.append( "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/read" ) @@ -441,6 +504,39 @@ def run_azure_ai_doctor(subscription_id: str = None) -> None: "(permission may still be present)" ) + # Check: Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments/read + # Required by azure.ml.online_endpoint.idle to read instance SKU and replica counts. + if workspaces and _endpoints: + try: + ws = workspaces[0] + rg = ws.id.split("/")[ws.id.lower().split("/").index("resourcegroups") + 1] + ep_name = _endpoints[0].name + list(ml_client.online_deployments.list(rg, ws.name, ep_name)) + permissions_tested.append( + "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments/read" + ) + success("Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments/read") + except Exception as e: + permissions_failed.append( + ( + "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments/read", + str(e), + ) + ) + warn( + f"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments/read — {e}" + ) + elif workspaces: + info( + " Skipping onlineEndpoints/deployments/read check — no endpoints found to test against " + "(permission may still be present)" + ) + else: + info( + " Skipping onlineEndpoints/deployments/read check — no workspaces found to test against " + "(permission may still be present)" + ) + # Check: Microsoft.CognitiveServices/accounts/read (for Azure OpenAI provisioned deployments) try: from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient @@ -491,6 +587,27 @@ def run_azure_ai_doctor(subscription_id: str = None) -> None: permissions_failed.append(("Microsoft.Search/searchServices/read", str(e))) warn(f"Microsoft.Search/searchServices/read — {e}") + # Data-plane RBAC warning for azure.ai_search.idle + # The rule calls Azure AI Search data-plane APIs (list indexes, indexers, etc.) using + # keyless RBAC auth. Management-plane Reader is not sufficient — the identity also needs + # Search Index Data Reader (or Search Service Contributor) assigned on each Search service. + # This cannot be probed here without knowing a service endpoint, so we always emit a notice. + info("") + warn("azure.ai_search.idle — data-plane RBAC required (not verified here)") + info(" The AI Search idle rule calls data-plane APIs to check structural emptiness.") + info(" Management-plane Reader alone is not sufficient. Assign one of:") + info(" - Search Index Data Reader (read-only, recommended)") + info(" - Search Service Contributor (broader access)") + info(" Scope: each Azure AI Search service (or resource group / subscription).") + info(" Without this, the rule skips all Search services silently.") + info(" Assign with:") + info(" az role assignment create \\") + info(' --role "Search Index Data Reader" \\') + info(" --assignee \\") + info( + " --scope /subscriptions//resourceGroups//providers/Microsoft.Search/searchServices/" + ) + # Check: Microsoft.Insights/metrics/read (already required by hygiene rules) try: from azure.mgmt.monitor import MonitorManagementClient diff --git a/cleancloud/providers/aws/rules/elb_idle.py b/cleancloud/providers/aws/rules/elb_idle.py index 4d290be..45c68b7 100644 --- a/cleancloud/providers/aws/rules/elb_idle.py +++ b/cleancloud/providers/aws/rules/elb_idle.py @@ -44,7 +44,7 @@ APIs: - elbv2:DescribeLoadBalancers - - elb:DescribeLoadBalancers + - elasticloadbalancing:DescribeLoadBalancers (CLB) - cloudwatch:GetMetricStatistics - elbv2:DescribeTargetGroups (contextual) - elbv2:DescribeTargetHealth (contextual) @@ -581,7 +581,7 @@ def _scan_clb( code = exc.response["Error"]["Code"] if code in ("AccessDenied", "UnauthorizedOperation"): raise PermissionError( - "Missing required IAM permission: elb:DescribeLoadBalancers" + "Missing required IAM permission: elasticloadbalancing:DescribeLoadBalancers" ) from exc raise except BotoCoreError: diff --git a/cleancloud/providers/aws/scan.py b/cleancloud/providers/aws/scan.py index 600e459..f54b925 100644 --- a/cleancloud/providers/aws/scan.py +++ b/cleancloud/providers/aws/scan.py @@ -56,7 +56,10 @@ "aws.ec2.ami.old": find_old_amis, "aws.ec2.nat_gateway.idle": find_idle_nat_gateways, "aws.rds.instance.idle": find_idle_rds_instances, - "aws.elbv2.load_balancer.idle": find_idle_load_balancers, + "aws.elbv2.load_balancer.idle": find_idle_load_balancers, # aggregate key — params/enabled only + "aws.elbv2.alb.idle": find_idle_load_balancers, # split IDs for exceptions/filters + "aws.elbv2.nlb.idle": find_idle_load_balancers, + "aws.elb.clb.idle": find_idle_load_balancers, "aws.ec2.instance.stopped": find_stopped_ec2_instances, "aws.ec2.security_group.unused": find_unused_security_groups, "aws.rds.snapshot.old": find_old_rds_snapshots, @@ -71,7 +74,7 @@ "aws.sagemaker.training_job.long_running": find_long_running_sagemaker_training_jobs, } -AWS_RULES: List[Callable] = list(AWS_RULE_MAP.values()) +AWS_RULES: List[Callable] = list(dict.fromkeys(AWS_RULE_MAP.values())) # AI/ML waste rules — not run by default; use --category ai or --category all AWS_AI_RULES: List[Callable] = list(AWS_RULE_MAP_AI.values()) diff --git a/deploy/cloudformation/cleancloud-role.yaml b/deploy/cloudformation/cleancloud-role.yaml index 2aeb9b5..1bc9236 100644 --- a/deploy/cloudformation/cleancloud-role.yaml +++ b/deploy/cloudformation/cleancloud-role.yaml @@ -63,6 +63,7 @@ Resources: Action: - ec2:DescribeVolumes - ec2:DescribeSnapshots + - ec2:DescribeSnapshotAttribute - ec2:DescribeImages - ec2:DescribeAddresses - ec2:DescribeNetworkInterfaces @@ -83,6 +84,12 @@ Resources: Action: - rds:DescribeDBInstances - rds:DescribeDBSnapshots + - rds:DescribeDBSnapshotAttributes + Resource: "*" + - Sid: CloudTrailReadOnly + Effect: Allow + Action: + - cloudtrail:LookupEvents Resource: "*" - Sid: CloudWatchReadOnly Effect: Allow diff --git a/deploy/terraform/aws/main.tf b/deploy/terraform/aws/main.tf index 7536edc..6b769f6 100644 --- a/deploy/terraform/aws/main.tf +++ b/deploy/terraform/aws/main.tf @@ -100,6 +100,7 @@ resource "aws_iam_role_policy" "cleancloud" { Action = [ "ec2:DescribeVolumes", "ec2:DescribeSnapshots", + "ec2:DescribeSnapshotAttribute", "ec2:DescribeImages", "ec2:DescribeAddresses", "ec2:DescribeNetworkInterfaces", @@ -126,6 +127,15 @@ resource "aws_iam_role_policy" "cleancloud" { Action = [ "rds:DescribeDBInstances", "rds:DescribeDBSnapshots", + "rds:DescribeDBSnapshotAttributes", + ] + Resource = "*" + }, + { + Sid = "CloudTrailReadOnly" + Effect = "Allow" + Action = [ + "cloudtrail:LookupEvents", ] Resource = "*" }, diff --git a/docs/azure.md b/docs/azure.md index 1c7f229..2b77a70 100644 --- a/docs/azure.md +++ b/docs/azure.md @@ -129,17 +129,24 @@ No `AZURE_CLIENT_SECRET` needed — OIDC uses federated credentials. ## AI/ML rules (opt-in) -CleanCloud includes additional AI/ML waste detectors that run only when you pass `--category ai` (or `--category all`). Two new Azure rules were added: +CleanCloud includes additional AI/ML waste detectors that run only when you pass `--category ai` (or `--category all`). Five Azure AI/ML rules are available: -- `azure.ml.online_endpoint.idle` — Detects Azure ML managed online endpoints in `Succeeded` provisioning state that have received zero scoring requests for 7+ days. These endpoints bill per-instance (minimum replica count) regardless of traffic; signals are confirmed via per-endpoint Azure Monitor metrics (RequestCount, fallback ModelEndpointRequests). Age-only fallback applies only when metric data is unavailable and endpoint age >= 2× idle window (MEDIUM confidence). +- `azure.aml.compute.idle` — Detects Azure ML compute clusters with non-zero minimum node count (baseline capacity always billed) and no workload activity over a fixed 14-day window. GPU clusters flagged HIGH risk. Fixed window — `idle_days` is not configurable. -- `azure.ai_search.idle` — Detects Azure AI Search services on Standard tier or above with effectively zero search queries (SearchQueriesPerSecond average == 0) over a 30-day window. Cost is computed per SKU × replicas × partitions. If metrics are unavailable, age-only fallback (age >= 2× idle window) yields MEDIUM confidence. +- `azure.ml.compute_instance.idle` — Detects Azure ML Compute Instances with no control-plane activity for `idle_days` (default 14). GPU instances flagged CRITICAL risk ($600–$15K+/month). + +- `azure.ml.online_endpoint.idle` — Detects Azure ML managed online endpoints in `Succeeded` provisioning state with zero `RequestsPerMinute` over a rolling `idle_days` window (default 7). These endpoints bill per-instance (minimum replica count) regardless of traffic. Metric result must resolve to ZERO with ≥80% minute-bucket coverage — insufficient coverage or query failure causes the endpoint to be skipped (fail-closed, no age-only fallback). + +- `azure.ai_search.idle` — Detects Azure AI Search services on Standard tier or above with effectively zero search queries (SearchQueriesPerSecond average == 0) over a fixed 90-day window. Requires both structural emptiness (no indexes, indexers, data sources, skillsets, synonym maps) AND confirmed metric silence. Cost model: None (SKU pricing too variable). Risk: MEDIUM. Confidence: HIGH when all conditions met. **Data-plane RBAC required** (Search Index Data Reader or equivalent) — management-plane Reader alone is not sufficient. + +- `azure.openai.provisioned_deployment.idle` — Detects Azure OpenAI provisioned deployments (ProvisionedManaged, GlobalProvisionedManaged, DataZoneProvisionedManaged SKUs) with zero AzureOpenAIRequests over a rolling `idle_days` window (default 7, max 30). PTU deployments bill ~$1,460/PTU/month on-demand regardless of traffic. Risk: HIGH always. Cost: None (no fixed PTU price constant). Permissions required for AI/ML scans The following actions are required by the AI/ML rules (add these to a custom role such as `security/azure/ai-readonly-role.json`): - Microsoft.MachineLearningServices/workspaces/read +- Microsoft.MachineLearningServices/workspaces/computes/read - Microsoft.MachineLearningServices/workspaces/onlineEndpoints/read - Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments/read - Microsoft.CognitiveServices/accounts/read @@ -286,6 +293,7 @@ az role assignment create \ | `Microsoft.Compute/disks/read` | Unattached managed disks | | `Microsoft.Compute/snapshots/read` | Old snapshots | | `Microsoft.Compute/virtualMachines/read` | Stopped (not deallocated) VMs | +| `Microsoft.Compute/virtualMachines/instanceView/action` | Stopped VM power state (instance view) | | `Microsoft.Network/publicIPAddresses/read` | Unused public IPs | | `Microsoft.Network/loadBalancers/read` | Empty load balancers | | `Microsoft.Network/applicationGateways/read` | Empty app gateways | @@ -294,6 +302,7 @@ az role assignment create \ | `Microsoft.Web/serverfarms/read` | Empty App Service Plans | | `Microsoft.Web/serverfarms/sites/read` | Empty App Service Plans (app count) | | `Microsoft.Web/sites/read` | Idle App Services | +| `Microsoft.Web/sites/webJobs/read` | Idle App Services (WebJobs enumeration) | | `Microsoft.ContainerRegistry/registries/read` | Unused Container Registries | | `Microsoft.Sql/servers/read` | SQL server discovery | | `Microsoft.Sql/servers/databases/read` | Idle SQL databases | diff --git a/docs/configuration.md b/docs/configuration.md index 65cd3b4..05291b2 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -166,14 +166,15 @@ See [rules.md](rules.md) for the full list of rule IDs and their supported param | Param | Rule ID | Default | Description | |---|---|---|---| -| `idle_days_threshold` | `aws.elbv2.load_balancer.idle` | 14 | Days of zero traffic before flagging | +| `idle_days_threshold` | `aws.elbv2.alb.idle` | 14 | Days of zero traffic before flagging (ALB) | +| `idle_days_threshold` | `aws.elbv2.nlb.idle` | 14 | Days of zero traffic before flagging (NLB) | +| `idle_days_threshold` | `aws.elb.clb.idle` | 14 | Days of zero traffic before flagging (CLB) | | `idle_days_threshold` | `aws.ec2.nat_gateway.idle` | 14 | Days of zero traffic before flagging | | `idle_days_threshold` | `aws.rds.instance.idle` | 14 | Days of no connections before flagging | | `idle_days_threshold` | `aws.sagemaker.endpoint.idle` | 14 | Days of no observed `InvokeEndpoint` traffic before flagging | | `idle_days_threshold` | `aws.sagemaker.notebook.idle` | 14 | Days of stale control-plane timestamp state before flagging | | `idle_days_threshold` | `aws.sagemaker.studio_app.idle` | 7 | Days since the last usable Studio app activity timestamp before flagging | | `long_running_hours_threshold` | `aws.sagemaker.training_job.long_running` | 24 | Hours before an `InProgress` SageMaker training job is flagged | -| `idle_days` | `azure.aml.compute.idle` | 14 | Days of no runs before flagging | | `idle_days` | `azure.ml.compute_instance.idle` | 14 | Days since last control-plane activity before flagging | | `idle_days` | `azure.sql.database.idle` | 14 | Days of no connections before flagging | | `idle_days` | `azure.app_service.idle` | 14 | Days of zero requests before flagging | diff --git a/docs/rules.md b/docs/rules.md index 430294d..2e20540 100644 --- a/docs/rules.md +++ b/docs/rules.md @@ -1,6 +1,6 @@ # CleanCloud Rules -45 rules across three providers (30 hygiene + 15 AI/ML). +46 rules across three providers (30 hygiene + 16 AI/ML). | Provider | Hygiene | AI/ML | Total | Catalog | |---|---|---|---|---| diff --git a/docs/rules/azure.md b/docs/rules/azure.md index 912e6f3..0a7e7f8 100644 --- a/docs/rules/azure.md +++ b/docs/rules/azure.md @@ -149,7 +149,7 @@ **Confidence / Risk:** HIGH (zero HTTP traffic confirmed) / MEDIUM -**Permissions:** `Microsoft.Web/sites/read`, `Microsoft.Web/serverfarms/read`, `Microsoft.Insights/metrics/read` +**Permissions:** `Microsoft.Web/sites/read`, `Microsoft.Web/sites/webJobs/read`, `Microsoft.Web/serverfarms/read`, `Microsoft.Insights/metrics/read` **Params:** `days_idle` (default: 14) diff --git a/pyproject.toml b/pyproject.toml index 4b05acc..0b9d670 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "cleancloud" -version = "1.26.0" +version = "1.27.0" description = "Read-only cloud hygiene for AWS, Azure, and GCP. Multi-account org scanning, CI/CD enforcement, and deterministic cost modeling. No agents, no telemetry." readme = "README.md" requires-python = ">=3.10" diff --git a/security/aws/hygiene-readonly.json b/security/aws/hygiene-readonly.json index 21f55c7..a1c7107 100644 --- a/security/aws/hygiene-readonly.json +++ b/security/aws/hygiene-readonly.json @@ -7,6 +7,7 @@ "Action": [ "ec2:DescribeVolumes", "ec2:DescribeSnapshots", + "ec2:DescribeSnapshotAttribute", "ec2:DescribeImages", "ec2:DescribeAddresses", "ec2:DescribeNetworkInterfaces", @@ -32,7 +33,16 @@ "Effect": "Allow", "Action": [ "rds:DescribeDBInstances", - "rds:DescribeDBSnapshots" + "rds:DescribeDBSnapshots", + "rds:DescribeDBSnapshotAttributes" + ], + "Resource": "*" + }, + { + "Sid": "CloudTrailReadOnly", + "Effect": "Allow", + "Action": [ + "cloudtrail:LookupEvents" ], "Resource": "*" }, diff --git a/security/azure/hygiene-readonly-role.json b/security/azure/hygiene-readonly-role.json index 144d8f9..2d8897f 100644 --- a/security/azure/hygiene-readonly-role.json +++ b/security/azure/hygiene-readonly-role.json @@ -11,6 +11,7 @@ "Microsoft.Compute/disks/read", "Microsoft.Compute/snapshots/read", "Microsoft.Compute/virtualMachines/read", + "Microsoft.Compute/virtualMachines/instanceView/action", "Microsoft.Network/publicIPAddresses/read", "Microsoft.Network/loadBalancers/read", "Microsoft.Network/applicationGateways/read", @@ -19,6 +20,7 @@ "Microsoft.Web/serverfarms/read", "Microsoft.Web/serverfarms/sites/read", "Microsoft.Web/sites/read", + "Microsoft.Web/sites/webJobs/read", "Microsoft.ContainerRegistry/registries/read", "Microsoft.Sql/servers/read", "Microsoft.Sql/servers/databases/read", diff --git a/tests/cleancloud/safety/aws/test_aws_iam_policy_parity.py b/tests/cleancloud/safety/aws/test_aws_iam_policy_parity.py new file mode 100644 index 0000000..faf66cd --- /dev/null +++ b/tests/cleancloud/safety/aws/test_aws_iam_policy_parity.py @@ -0,0 +1,117 @@ +""" +Parity test: assert AWS IAM policy files contain every action required +by the corresponding rule implementations. + +Rationale: the existing read-only safety test (`test_aws_iam_policy_readonly.py`) +ensures no mutating actions slip in, but does NOT verify coverage. This test catches +the complementary failure mode — a required action silently omitted from the shipped +policy, leaving users with an "official" policy that produces coverage gaps at runtime. +""" + +import json +from pathlib import Path + +import pytest + +# --------------------------------------------------------------------------- +# Required actions per policy file — derived from rule implementations +# --------------------------------------------------------------------------- + +HYGIENE_REQUIRED_ACTIONS = { + # aws.ebs.unattached + "ec2:DescribeVolumes", + # aws.ebs.snapshot.old — list + public-snapshot attribute check + "ec2:DescribeSnapshots", + "ec2:DescribeSnapshotAttribute", + # aws.ec2.ami.old + "ec2:DescribeImages", + # aws.ec2.elastic_ip.unattached + "ec2:DescribeAddresses", + # aws.ec2.eni.detached + "ec2:DescribeNetworkInterfaces", + # aws.ec2.nat_gateway.idle + "ec2:DescribeNatGateways", + # region discovery + "ec2:DescribeRegions", + # aws.ec2.instance.stopped, aws.ec2.security_group.unused + "ec2:DescribeInstances", + "ec2:DescribeSecurityGroups", + # aws.elbv2.alb.idle / aws.elbv2.nlb.idle / aws.elb.clb.idle + "elasticloadbalancing:DescribeLoadBalancers", + "elasticloadbalancing:DescribeTargetGroups", + "elasticloadbalancing:DescribeTargetHealth", + # aws.rds.instance.idle + aws.rds.snapshot.old + public-snapshot attribute check + "rds:DescribeDBInstances", + "rds:DescribeDBSnapshots", + "rds:DescribeDBSnapshotAttributes", + # aws.ec2.instance.stopped — stopped-duration CloudTrail probe + "cloudtrail:LookupEvents", + # metrics (NAT gateway, RDS, ELB idle detection) + "cloudwatch:GetMetricStatistics", + # aws.cloudwatch.logs.infinite_retention + "logs:DescribeLogGroups", + # aws.resource.untagged + "s3:ListAllMyBuckets", + "s3:GetBucketTagging", +} + +AI_REQUIRED_ACTIONS = { + # aws.sagemaker.endpoint.idle + "sagemaker:ListEndpoints", + "sagemaker:DescribeEndpoint", + "sagemaker:DescribeEndpointConfig", + # aws.sagemaker.notebook.idle + "sagemaker:ListNotebookInstances", + "sagemaker:DescribeNotebookInstance", + # aws.sagemaker.studio_app.idle + "sagemaker:ListApps", + "sagemaker:DescribeApp", + # aws.sagemaker.training_job.long_running + "sagemaker:ListTrainingJobs", + "sagemaker:DescribeTrainingJob", + # aws.bedrock.provisioned_throughput.idle + "bedrock:ListProvisionedModelThroughputs", + # aws.ec2.gpu.idle + "ec2:DescribeInstances", + "cloudwatch:GetMetricStatistics", + "cloudwatch:ListMetrics", +} + +POLICY_PARITY: list[tuple[Path, set[str]]] = [ + (Path("security/aws/hygiene-readonly.json"), HYGIENE_REQUIRED_ACTIONS), + (Path("security/aws/ai-readonly.json"), AI_REQUIRED_ACTIONS), +] + + +def _actions_in_policy(policy_path: Path) -> set[str]: + policy = json.loads(policy_path.read_text()) + actions: set[str] = set() + for statement in policy.get("Statement", []): + raw = statement.get("Action", []) + if isinstance(raw, str): + raw = [raw] + for action in raw: + actions.add(action) + return actions + + +@pytest.mark.safety +@pytest.mark.aws +@pytest.mark.parametrize( + "policy_path,required", + POLICY_PARITY, + ids=lambda x: x.name if isinstance(x, Path) else "required", +) +def test_aws_iam_policy_contains_required_actions(policy_path, required): + """ + Assert that every runtime-required action is present in the shipped IAM policy. + Missing actions cause silent coverage gaps at runtime — rules skip resources + without any error when the required permission is absent. + """ + actual = _actions_in_policy(policy_path) + missing = required - actual + assert not missing, ( + f"{policy_path.name} is missing {len(missing)} required action(s):\n" + + "\n".join(f" - {a}" for a in sorted(missing)) + + "\nAdd them to the IAM policy to prevent silent coverage gaps at runtime." + ) diff --git a/tests/cleancloud/safety/azure/test_azure_role_parity.py b/tests/cleancloud/safety/azure/test_azure_role_parity.py new file mode 100644 index 0000000..a77cf3f --- /dev/null +++ b/tests/cleancloud/safety/azure/test_azure_role_parity.py @@ -0,0 +1,103 @@ +""" +Parity test: assert Azure role template files contain every action required +by the corresponding rule implementations. + +Rationale: the existing read-only safety test (`test_azure_role_definition_readonly.py`) +ensures no mutating actions slip in, but does NOT verify coverage. This test catches +the complementary failure mode — a required action silently omitted from the shipped role, +leaving users with an "official" role that produces coverage gaps at runtime. +""" + +import json +from pathlib import Path + +import pytest + +# --------------------------------------------------------------------------- +# Required actions per role file — derived from rule headers and doctor probes +# --------------------------------------------------------------------------- + +HYGIENE_REQUIRED_ACTIONS = { + # azure.compute.managed_disk.unattached + "Microsoft.Compute/disks/read", + # azure.compute.snapshot.old + "Microsoft.Compute/snapshots/read", + # azure.vm.stopped_not_deallocated — list + instance view (PowerState) + "Microsoft.Compute/virtualMachines/read", + "Microsoft.Compute/virtualMachines/instanceView/action", + # azure.network.public_ip.unused + "Microsoft.Network/publicIPAddresses/read", + # azure.load_balancer.no_backends + "Microsoft.Network/loadBalancers/read", + # azure.app_gateway.no_backends + "Microsoft.Network/applicationGateways/read", + # azure.vnet_gateway.idle + "Microsoft.Network/virtualNetworkGateways/read", + "Microsoft.Network/connections/read", + # azure.app_service_plan.empty + "Microsoft.Web/serverfarms/read", + "Microsoft.Web/serverfarms/sites/read", + # azure.app_service.idle — includes WebJobs enumeration + "Microsoft.Web/sites/read", + "Microsoft.Web/sites/webJobs/read", + # azure.container_registry.unused + "Microsoft.ContainerRegistry/registries/read", + # azure.sql.database.idle + "Microsoft.Sql/servers/read", + "Microsoft.Sql/servers/databases/read", + # metrics (sql, app service, vnet gateway, container registry) + "Microsoft.Insights/metrics/read", + # subscription + resource discovery + "Microsoft.Resources/subscriptions/read", + "Microsoft.Resources/resources/read", +} + +AI_REQUIRED_ACTIONS = { + # azure.aml.compute.idle, azure.ml.compute_instance.idle + "Microsoft.MachineLearningServices/workspaces/read", + "Microsoft.MachineLearningServices/workspaces/computes/read", + # azure.ml.online_endpoint.idle — endpoint list + deployment reads + "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/read", + "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments/read", + # azure.openai.provisioned_deployment.idle + "Microsoft.CognitiveServices/accounts/read", + "Microsoft.CognitiveServices/accounts/deployments/read", + # azure.ai_search.idle (management-plane; data-plane RBAC is separate) + "Microsoft.Search/searchServices/read", + # metrics for all AI rules + "Microsoft.Insights/metrics/read", +} + +ROLE_PARITY: list[tuple[Path, set[str]]] = [ + (Path("security/azure/hygiene-readonly-role.json"), HYGIENE_REQUIRED_ACTIONS), + (Path("security/azure/ai-readonly-role.json"), AI_REQUIRED_ACTIONS), +] + + +def _actions_in_role(role_path: Path) -> set[str]: + role = json.loads(role_path.read_text()) + actions: set[str] = set() + for perm in role.get("Permissions", []): + for action in perm.get("Actions", []): + actions.add(action) + return actions + + +@pytest.mark.safety +@pytest.mark.azure +@pytest.mark.parametrize( + "role_path,required", ROLE_PARITY, ids=lambda x: x.name if isinstance(x, Path) else "required" +) +def test_azure_role_contains_required_actions(role_path, required): + """ + Assert that every runtime-required action is present in the shipped role template. + Missing actions cause silent coverage gaps at runtime — rules skip resources + without any error when the required permission is absent. + """ + actual = _actions_in_role(role_path) + missing = required - actual + assert not missing, ( + f"{role_path.name} is missing {len(missing)} required action(s):\n" + + "\n".join(f" - {a}" for a in sorted(missing)) + + "\nAdd them to the role template to prevent silent coverage gaps at runtime." + )