Defect Cause Catalog
A catalog of where defects are usually created in the value stream, methods for automated detection, where adding AI to the flow adds value, and systemic remediation strategies.
| Category | Defect Cause | Earliest Detection | Auto Detection | Detection Method | AI Improvement | Systemic Correction |
|---|---|---|---|---|---|---|
| Product & Discovery | Building the wrong thing | Discovery | Production | Adoption dashboards (Amplitude; Mixpanel); funnel drop-off alerts; usage thresholds on <X% adoption after N days | Synthesize user feedback; support tickets; and usage data to surface misalignment signals earlier than production metrics | Validated user research before backlog entry; dual-track agile; kill features that miss adoption thresholds |
| Product & Discovery | Solving a problem nobody has | Discovery | Production | Ticket topic clustering vs. feature investment; survey correlation with releases; feature request tracking | Semantic analysis of interview transcripts; forums; and tickets to identify actual vs. assumed pain | Problem validation as a stage gate; publish problem brief before solution brief; quantify user pain |
| Product & Discovery | Correct problem; wrong solution | Discovery | Production | A/B testing frameworks; feature flag cohort comparison; statistical significance calculators; task completion tracking | Evaluate prototypes against problem definitions; generate alternative solution approaches | Prototype multiple approaches; measurable success criteria before building; experiment with flags first |
| Product & Discovery | Meets spec but misses user intent | Requirements | Production | UX analytics (FullStory; Hotjar); rage-click and error-loop detection; task completion rate tracking | Review acceptance criteria for missing user outcomes; analyze session replays at scale for frustration patterns | Acceptance criteria as user outcomes; not functional checklists; regular usability testing cadence |
| Product & Discovery | Over-engineering beyond need | Design | CI | Static analysis for dead code and unused abstractions (SonarQube; ESLint); complexity scoring; LOC per feature | Flag unnecessary abstraction layers; premature optimization; and complexity vs. actual requirements in code review | YAGNI as team norm; time-box spikes; justify every abstraction layer; review architecture vs. actual scale quarterly |
| Product & Discovery | Prioritizing wrong work | Discovery | Production | DORA metrics vs. business outcomes; WSJF scoring; cost of delay dashboards; capacity vs. outcome tracking | Synthesize roadmap; customer data; and market signals to surface opportunity cost | WSJF for prioritization; regular portfolio reviews with outcome data; publish what you chose NOT to do |
| Integration & Boundaries | Interface mismatches | CI | CI | Consumer-driven contract tests (Pact); schema validation (OpenAPI; protobuf/buf); API compat checks per PR | Predict which consumers break from API changes based on usage patterns when formal contracts don't exist | Contract tests mandatory per boundary; API-first with generated clients; version + migration plan before merge |
| Integration & Boundaries | Wrong assumptions about upstream/downstream | Design | Staging | Chaos engineering (Gremlin; Litmus); synthetic transactions; fault injection; circuit breaker monitoring | Review code/docs to identify undocumented behavioral assumptions (timeouts; retries; error semantics) | Document behavioral contracts; not just schemas; defensive coding at boundaries; circuit breakers as default |
| Integration & Boundaries | Race conditions | Pre-Commit | Pre-Commit | Thread sanitizers (TSan); race detectors; TLA+; fuzz testing; load testing with concurrency | Limited: Can flag concurrency anti-patterns in review but cannot replace formal detection tools | Design for idempotency; queues over shared mutable state; lock ordering conventions; concurrency review checklist |
| Integration & Boundaries | Inconsistent distributed state | Design | Staging | Distributed tracing (Jaeger/Zipkin); reconciliation jobs; saga completion monitoring; anomaly detection | Review designs for missing compensation logic or consistency model mismatches | Choose consistency model deliberately per use case; saga with compensating transactions; event sourcing |
| Knowledge & Communication | Implicit domain knowledge not in code | Coding | CI | Magic number detection; git knowledge-concentration metrics (CodeScene; git-fame); onboarding time tracking | High Value: Identify undocumented business rules; missing 'why' in code; and knowledge gaps a new dev would hit | DDD with ubiquitous language; embed rules in code not wikis; pair across experience levels; rotate ownership |
| Knowledge & Communication | Ambiguous requirements | Requirements | CI | Flag stories without acceptance criteria; BDD spec coverage tracking; defect classification for 'requirements gap' | High Value: Review requirements for ambiguity; missing edge cases; contradictions; generate test scenarios from specs | Three Amigos before work starts; example mapping; executable specs as source of truth; given/when/then required |
| Knowledge & Communication | Tribal knowledge loss | Coding | CI | Bus factor from git history (CodeScene; git-fame); single-author concentration alerts; doc freshness checks | Generate documentation from code/tests; flag where docs have drifted from implementation | Pair/mob programming as default; rotate on-call; automate tribal knowledge; living docs from code |
| Knowledge & Communication | Divergent mental models across teams | Design | CI | Divergent naming detection across codebases; contract test failures; integration defect tracking | High Value: Compare terminology and domain models across codebases to detect semantic mismatches | Shared domain model; explicit bounded contexts; regular cross-team syncs; shared glossary enforced via linting |
| Change & Complexity | Unintended side effects | CI | CI | Automated test suites; mutation testing (Stryker; PIT); change impact analysis flagging downstream consumers | Reason about semantic change impact beyond syntactic dependencies | Small focused commits; trunk-based development; feature flags to decouple deploy from release |
| Change & Complexity | Accumulated technical debt | CI | CI | Complexity trends; duplication scoring; dependency cycles; SonarQube quality gates; TODO/HACK counts | Identify architectural drift; abstraction decay; and calcified workarounds that static analysis misses | Refactoring as part of every story; dedicated debt budget; boy scout rule; treat rising complexity as leading indicator |
| Change & Complexity | Unanticipated feature interactions | Staging | Staging | Combinatorial/pairwise testing; feature flag interaction matrix (LaunchDarkly); canary with auto-rollback; regression suites | Reason about feature interactions semantically; flag conflicts that testing matrices miss | Feature flags with controlled rollout; modular design; canary deployments with auto-rollback on anomaly |
| Change & Complexity | Configuration drift | CI | CI | IaC drift detection (Terraform plan; Pulumi preview; AWS Config); environment comparison; smoke tests per environment | Not needed | All infrastructure as code; immutable infrastructure; GitOps; identical provisioning from same source |
| Testing & Observability Gaps | Untested edge cases and error paths | CI | CI | Mutation testing (Stryker; PIT); branch coverage thresholds; property-based testing (Hypothesis; fast-check) | Analyze code paths; generate tests for untested boundaries and error paths humans overlook | Tests required for every bug fix; property-based testing as standard; boundary value analysis; mutation scores as gate |
| Testing & Observability Gaps | Missing contract tests at boundaries | CI | CI | Boundary inventory vs. contract test inventory; CI fails if new endpoint lacks tests; Pact broker | Identify boundaries lacking tests by understanding semantic service relationships | Contract tests mandatory per new boundary; type varies by control level (see Contract Testing Strategies tab) |
| Testing & Observability Gaps | Insufficient monitoring | Design | CI | Observability coverage scoring; health endpoint checks; structured logging verification; SLO burn rate alerting | Review architectures; flag observability gaps against production readiness checklists | Observability as NFR on every service; production readiness checklist enforced; SLOs for every user-facing path |
| Testing & Observability Gaps | Test environments don't reflect production | CI | CI | Automated environment parity checks; synthetic transaction comparison; IaC diff tools; config comparison | Not needed | Same provisioning for all environments; production-like data in staging; containerization; test in prod with flags |
| Process & Deployment | Long-lived branches | Pre-Commit | Pre-Commit | Branch age alerts; merge conflict frequency; CI dashboard for branch count and divergence | Not needed | Trunk-based development; merge at least daily; CI rejects stale branches; feature flags eliminate branch need |
| Process & Deployment | Manual pipeline steps | CI | CI | Pipeline audit for manual gates; deployment lead time analysis; pipeline topology analysis | Not needed | Automate every step commit-to-production; manual approvals only for regulatory; treat pipeline as first-class product |
| Process & Deployment | Batching too many changes per release | CI | CI | Changes-per-deploy metrics; deployment frequency (DORA); batch size threshold alerts | Not needed | Continuous delivery — every commit is a candidate; single-piece flow; decouple deploy from release with flags |
| Process & Deployment | Inadequate rollback capability | CI | CI | Automated rollback testing in CI; mean time to rollback; migration reversibility checks | Not needed | Blue/green or canary as default; backward-compatible migrations only; auto-rollback on health failure; practice regularly |
| Process & Deployment | Reliance on human review to catch preventable defects | Coding | CI | Linters; SAST; type systems; complexity scoring catch syntax and known patterns but miss semantic issues | High Value: Semantic code review for logic errors; implicit assumptions; missing edge cases; design violations — shift human review from gate to mentoring | Automated quality gates in CI; AI review for correctness; reserve human review for knowledge transfer and design decisions; pair programming over async gatekeeping |
| Process & Deployment | Manual review of risks and compliance (CAB) | Design | Manual | Manual — risk checklists; change documentation review; impact assessment meetings; approval workflows | High Value: Automated risk scoring from change diff and deployment history; blast radius analysis; auto-approve low-risk; flag high-risk with evidence — eliminates delay without reducing safety | Replace CAB with automated pipelines; progressive delivery (canary/blue-green); auto-rollback on anomaly; every commit deployable; CAB adds delay and false confidence without improving safety |
| Data & State | Schema migration / backward compat failures | CI | CI | Schema compatibility checks (Avro; protobuf/buf); migration dry-runs against production-like data | Predict downstream impact by understanding how consumers actually use data beyond formal compatibility | Expand-then-contract always; never deploy breaking schema changes; schema registry with compat enforcement |
| Data & State | Null/missing data assumptions | Pre-Commit | Pre-Commit | Null safety static analysis (NullAway; TypeScript strict); null-input test generation; NPE monitoring | Flag code where optional fields used without null checks; suggest fixes even in non-strict languages | Enforce null-safe type systems; Option/Maybe as default; validate at boundaries; assert and handle explicitly |
| Data & State | Concurrency and ordering issues | CI | CI | Thread sanitizers (TSan); load tests with randomized timing; idempotency verification; model checkers (TLA+) | Not needed | Design for out-of-order delivery; idempotent consumers; version vectors on events; prefer immutable data |
| Data & State | Cache invalidation errors | Staging | Staging | Cache consistency monitoring; TTL verification; stale data detection; hit rate anomaly alerts | Review cache invalidation logic for incomplete paths or TTL/change-frequency mismatches | Short TTLs over complex invalidation; event-driven invalidation; cache-aside with explicit invalidation; alert on staleness |
| Dependency & Infrastructure | Third-party library breaking changes | CI | CI | Automated upgrade PRs (Dependabot; Renovate); SCA for breaking versions; lock file drift detection | Review changelogs to assess breaking change risk; predict compatibility issues from actual usage | Pin dependencies; automated upgrade PRs with test gates; abstract over volatile dependencies; evaluate stability before adoption |
| Dependency & Infrastructure | Infrastructure differences across environments | CI | CI | IaC drift detection; config comparison across environments; environment parity scoring; GitOps reconciliation | Not needed | Single source of truth for all environments; immutable infrastructure; containerization; GitOps with promotion |
| Dependency & Infrastructure | Network partitions / partial failures handled wrong | Staging | Staging | Chaos engineering (Gremlin; Litmus); synthetic transaction monitoring; circuit breaker state monitoring | Review architectures for missing failure patterns (no circuit breaker; no retry/backoff; no bulkhead) | Design for failure as default; circuit breakers; retries; bulkheads; test failure modes explicitly; game days |
Last update:
2026-02-12