Config recorder stopped delivering snapshot
| Service | AWS Config |
|---|---|
| Cloud | Amazon Web Services (AWS) |
| Guide type | Procedure |
| Skill level | Intermediate to advanced |
| Time | 15 - 60 minutes depending on account size |
Config recorder stopped delivering snapshot on AWS Config sits in the most-reported issues list across r/aws, AWS re:Post, and StackOverflow. The recovery path is mostly known, the AWS docs just bury it under three layers of conceptual material.
What config recorder stopped delivering snapshot actually involves on AWS Config
This task on AWS Config is one of the more searched operational topics on AWS in the last 12 months. The procedure below is the path that works in a current AWS account with default IAM and standard VPC config.
The rest of this page is the structured fix path. Start with diagnose, then remediation, then the automation options so you do not have to do this by hand the next time it surfaces. Verify and safety sections at the end are the discipline that keeps the fix from regressing in production.
Spot the symptom
Diff against last known good. The last config change you made is the cause about three quarters of the time, even when the change should not have mattered. Use AWS Config history (or your Terraform / CloudFormation drift report) to see the actual delta between the resource state when it worked and when it broke. The change you remember is often not the only change that happened.
Check CloudWatch Logs for the calling service. Lambda, ECS, EKS, Step Functions, API Gateway, and most managed services write detailed traces to CloudWatch Logs under predictable log group names. Use CloudWatch Logs Insights with fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 to surface the most recent failures.
Check the AWS Health Dashboard at health.aws.amazon.com for ongoing service events in your region. About one in ten user-reported outages turn out to be region-scoped AWS service degradation already being tracked. AWS Health also exposes an API and EventBridge events, so you can wire a Lambda hook that pages on-call only when the failure correlates with an active AWS Health event in the same region and service.
Solution-focused remediation path
If networking is suspect, use VPC Reachability Analyzer. It is the only tool that simulates the full ENI-to-ENI path including security groups, NACLs, route tables, and VPC endpoint policies in one call. Manual trace is slower and misses transitive issues. The analyzer charges $0.10 per analysis - cheaper than a 30-minute call with your network team.
When the failure happens in production but not in dev, do not just compare the IAM policy. Compare the SCP / RCP at the OU level, the permission boundary on the role, and the resource-based policy on the target. One of those is almost always different between accounts. AWS Config conformance packs make this comparison routine.
If quotas are suspect, the Service Quotas console shows current usage and the active limit side by side. Request increases through Service Quotas, not through Support tickets - quota dashboard requests usually approve faster (often within minutes for soft limits) and they are auditable in CloudTrail. Set up Service Quotas + CloudWatch alarms at 80 percent usage so you get notified before you hit the wall.
Automate this fix so you do not do it twice
Wire the fix into EventBridge for self-healing
If the failure mode is recurring, automate the remediation instead of the diagnosis. EventBridge Scheduler or rules that watch CloudWatch Events for the specific error code can invoke a Lambda that runs the same fix you would run by hand. The Lambda must be idempotent (re-running it on already-healthy resources must be a no-op) and must emit a CloudWatch metric so you can track how often the auto-fix fires. A spike in auto-fix invocations is itself a signal worth alerting on.
# EventBridge rule pattern (JSON)
{ "source": ["aws.config"], "detail-type": ["AWS API Call via CloudTrail"], "detail": { "errorCode": ["AccessDenied", "ThrottlingException"] }
}Automate the fix with Python and boto3
For anything you do more than twice, write a small Python script. The boto3 pattern below uses paginators (so it does not blow up on accounts with thousands of resources), explicit region binding, and a dry-run flag that defaults to True. Keep the script under 100 lines; if it grows beyond that, you are building a tool and should put it behind a Lambda with proper logging.
import boto3, sys
DRY_RUN = '--apply' not in sys.argv
client = boto3.client('config', region_name='us-east-1')
paginator = client.get_paginator('describe_...')
for page in paginator.paginate(): for item in page.get('Items', []): if item.get('Status') == 'FAILED': if DRY_RUN: print(f'[dry-run] would fix {item["Id"]}') else: client.modify_...(ResourceId=item['Id']) print(f'fixed {item["Id"]}')Automate the fix with the AWS CLI
The CLI one-liner pattern for AWS Config operations is roughly: aws config describe-... --query ... to read state, aws config modify-... --no-dry-run to apply the change, and aws config describe-... --query ... again to verify. Wrap it in a shell script that sets a region variable at the top and exits on first error with set -euo pipefail so a partial run does not leave the account in a half-fixed state.
# Template - replace placeholders with your account specifics
export AWS_REGION=us-east-1
export AWS_PROFILE=prod
aws config describe-... --query 'Resources[?Status==`FAILED`].[Id,Reason]' --output table
aws config modify-... --resource-id RESOURCE_ID --no-dry-run
aws config describe-... --resource-id RESOURCE_ID --query 'Status'
Pitfalls
The pitfall most teams hit on AWS Config is moving too fast and skipping the read-only validation step. Before any write, list the current state and save it. AWS APIs are eventually consistent for many resource types, so the validation snapshot is your only reliable reference if you need to undo. Save the output of the describe call to S3, not to your laptop.
Second pitfall: confusing IAM permission errors with networking errors. AccessDenied can be IAM (policy missing), networking (VPC endpoint policy blocking the call), or KMS (key policy missing). The error string looks identical for all three. Distinguish by looking at the CloudTrail event's errorCode and the encoded authorization message; do not assume IAM is the culprit just because the message says AccessDenied.
Full fix path
- Reproduce the original symptom path. If it still surfaces in any account or region or IAM role, you have not fixed it.
- Watch for 24 to 48 hours. AWS metrics and policy systems can mask issues with cached health for 6 to 12 hours, especially CloudFront and Route 53.
- Run a smoke test under realistic load. Happy-path tests miss race conditions and IAM session-cache issues.
- Capture the new state in a runbook so the next person on call does not have to rediscover this. Push it to Confluence or your team wiki, not into Slack.
- If the fix involved a permission change, run IAM Access Analyzer one more time to confirm you did not open a separate hole while closing this one.
Safety, rollback, blast radius
- Test in a non-production account if your environment has Control Tower or AWS Organizations. The cost of one sandbox account is cheaper than one rollback meeting.
- Export the existing config before changing it. Most AWS Config resources support describe + export to JSON via CLI - capture that to source control before you start.
- Know your rollback path. Some AWS Config operations are one-way (region migration, account-level feature opt-in, KMS key deletion past pending window). Confirm reversibility on the AWS doc before you commit.
- Be aware of cross-service impact. IAM role changes ripple to every service trusting that role. KMS key changes break every workload depending on that key. VPC endpoint changes affect every VPC consumer of that endpoint.
- Maintenance window discipline: if the change touches DNS, certificate rotation, or anything that emits TLS handshakes, line up a window with stakeholder notification, not a heroic mid-day swap.
FAQ
aws config describe-... first, then commit it before you change anything. A few operations are one-way (KMS key deletion past the pending window, region migration, account closure). Check the AWS doc for the specific API before you commit.aws CLI or SDK calls - those almost always still work.References
- docs.aws.amazon.com - official documentation for AWS Config
- AWS re:Post (formerly forums) - community Q&A with AWS-staff-verified answers
- AWS Health Dashboard at health.aws.amazon.com
- AWS Service Quotas console and AWS Well-Architected Tool
Related fixes
Related guides worth a look while you sort this one out:
- OAC for Lambda function URL AWS_IAM auth-type and SigV4 signing config
- Audit Manager Evidence collection automated from CloudTrail Config Security Hub
- CloudTrail trail status logging stopped
- Config advanced query SQL parsing error
- Config aggregator authorization failed source account
- Config CloudFormation intrinsic function not supported pack