Inject controlled failures, latency spikes, and throttling so your retry, timeout, and graceful-degradation paths get exercised in tests

Chaos Engineering

cloudemu can deliberately fail or slow down services in controlled, time-bounded windows. Retry logic, timeouts, circuit breakers, and graceful-degradation paths in your code stay dead until something exercises them — chaos lets you exercise them deterministically in unit tests, no real cloud needed.

import (
    "github.com/stackshy/cloudemu/v2"
    "github.com/stackshy/cloudemu/v2/features/chaos"
    "github.com/stackshy/cloudemu/v2/config"
)

cloud := cloudemu.NewAWS()
engine := chaos.New(config.RealClock{})

// Wrap any driver — same wrapper works for the Portable API or the
// SDK-compat HTTP server.
chaosS3 := chaos.WrapBucket(cloud.S3, engine)

// Run your app against chaosS3...
engine.Apply(chaos.ServiceOutage("storage", 5*time.Second))
// ...calls fail with the chaos-configured error for 5 seconds, then
// recover automatically.

How it works

The chaos engine is an in-memory bus that tracks active scenarios. A wrapper sits in front of a driver and consults the engine on every call:

The call arrives at the wrapper.
The wrapper asks the engine "is anything affecting this service / operation right now?"
If yes, the chaos behavior runs — return an error, sleep, throttle.
Otherwise, the call passes through to the underlying driver.

Scenarios are time-bounded. Once their window elapses, the engine drops them and the service recovers automatically. No teardown step required.

Available scenarios

Scenario	Effect
`ServiceOutage(svc, duration)`	Every call to `svc` returns `Unavailable` for `duration`.
`LatencySpike(svc, extra, duration)`	Every call to `svc` sleeps for an extra `extra` time for `duration`.
`ProbabilisticFailure(svc, op, err, p, duration)`	Each call to `svc.op` fails with `err` with probability `p` for `duration`.
`Throttle(svc, op, qps, duration)`	`svc.op` is rate-limited to `qps` for `duration`; excess calls return `Throttled`.
`Composite(scenarios...)`	Apply multiple scenarios at once — useful for "outage + latency on dependent service" patterns.

// Outage on storage for 5s.
engine.Apply(chaos.ServiceOutage("storage", 5*time.Second))

// 50% failure rate on PutObject for 10s.
engine.Apply(chaos.ProbabilisticFailure(
    "storage", "PutObject",
    cerrors.New(cerrors.Unavailable, "simulated"),
    0.5,
    10*time.Second,
))

// Add 200ms latency to every database call for 30s.
engine.Apply(chaos.LatencySpike("database", 200*time.Millisecond, 30*time.Second))

// Throttle messagequeue.SendMessage to 10 QPS for 1 minute.
engine.Apply(chaos.Throttle("messagequeue", "SendMessage", 10, time.Minute))

Wrapping every service

Each service category has a Wrap* helper that returns a driver-typed value, so the wrapper is a drop-in replacement for the underlying driver:

cloud := cloudemu.NewAWS()
engine := chaos.New(config.RealClock{})

s3      := chaos.WrapBucket(cloud.S3, engine)
ec2     := chaos.WrapCompute(cloud.EC2, engine)
ddb     := chaos.WrapDatabase(cloud.DynamoDB, engine)
fns     := chaos.WrapServerless(cloud.Lambda, engine)
sqs     := chaos.WrapMessageQueue(cloud.SQS, engine)
mon     := chaos.WrapMonitoring(cloud.CloudWatch, engine)
iam     := chaos.WrapIAM(cloud.IAM, engine)
dns     := chaos.WrapDNS(cloud.Route53, engine)
lb      := chaos.WrapLoadBalancer(cloud.ELB, engine)
sec     := chaos.WrapSecrets(cloud.SecretsManager, engine)
log     := chaos.WrapLogging(cloud.CloudWatchLogs, engine)
notif   := chaos.WrapNotification(cloud.SNS, engine)
ebus    := chaos.WrapEventBus(cloud.EventBridge, engine)
acr     := chaos.WrapContainerRegistry(cloud.ECR, engine)
cache   := chaos.WrapCache(cloud.ElastiCache, engine)
network := chaos.WrapNetworking(cloud.VPC, engine)
smaker  := chaos.WrapSageMaker(cloud.SageMaker, engine)
// GCP: vai := chaos.WrapVertexAI(cloud.VertexAI, engine)

The wrappers preserve the underlying driver interface, so existing code (or the Portable API, or the SDK-Compat Server) sees no API change — only the runtime behavior changes when scenarios are active.

Composing with other features

Chaos works underneath the Portable API:

bucket := storage.NewBucket(
    chaos.WrapBucket(cloud.S3, engine), // chaos applies first
    storage.WithRecorder(rec),
    storage.WithMetrics(mc),
)

And under the SDK-compat server:

srv := awsserver.New(awsserver.Drivers{
    S3: chaos.WrapBucket(cloud.S3, engine),
})
ts := httptest.NewServer(srv)

// Real aws-sdk-go-v2 client now sees chaos failures end-to-end.

This is the load-bearing reason chaos sits between the driver and its consumers: every code path that goes through the driver inherits the chaos behavior, including paths that go through the real cloud SDKs via the SDK-compat server.

Why time-bounded

Untimed chaos sticks. A test that injects a permanent failure has to remember to clear it on teardown, or it leaks into the next test. Every chaos scenario has an explicit duration, so the system always recovers within a known window. If you want indefinite failure, pass a long duration; if you want a single-call failure, use ProbabilisticFailure with p=1.0 and a short duration.

Chaos Engineering

Chaos Engineering

How it works

Available scenarios

Wrapping every service

Composing with other features

Why time-bounded

On this page