The rise of the AI SDLC

About eight months ago, I gave a talk at AWS Community Day Dublin titled The Rise of the AI SDLC. The reception was warm, but more interesting to me was the conversation afterwards — engineers who kept asking variants of the same question: we’ve deployed Copilot, we’ve given everyone ChatGPT access, and we’re not seeing the transformation we expected. What are we doing wrong?

The honest answer is: nothing, if your goal was to help developers type faster. Everything, if your goal was to improve how software is built.

Those are not the same goal, and conflating them is what’s driving most of the disappointment with AI in engineering organizations today.


The wrong frame

The dominant narrative around AI in software development is individual productivity. Autocomplete. Code generation. “10x developer.” The metrics organizations reach for first — lines of code generated, suggestions accepted, time saved per task — all measure the same thing: output velocity for individual contributors.

This frame isn’t wrong. AI tools do make individual developers faster at certain tasks. But it’s incomplete in a way that matters enormously at scale.

The reason most software projects fail isn’t that developers type too slowly. It’s vague requirements. Inconsistent architectural decisions. Tribal knowledge that lives in three people’s heads and walks out the door when they leave. Design patterns that got established two years ago and nobody can tell you why. Integration contracts that are technically documented but practically ignored.

If AI is just autocomplete, it helps individual contributors produce output faster. It does nothing about the organizational problems that cause that output to be wrong, inconsistent, or misaligned. In fact, it can make those problems worse — because you can now produce misaligned output at higher velocity.

The shift I was advocating for in that talk — and the one I still believe in — is treating AI not as a productivity hack but as a quality enforcement system. One that propagates organizational context, enforces architectural standards, and makes the knowledge that currently lives in people’s heads machine-readable and systematically applicable.


Why code generation is the least valuable thing AI can do

This was the most provocative slide in the deck, and the one that got the most pushback.

My argument is simple: the bottleneck in software delivery is rarely in the coding. When you trace the root cause of most post-deployment defects, production incidents, or architectural drift back to its origin, it almost never points to a developer who typed too slowly. It points to a requirement that was never fully specified. A design decision that was made in a meeting and never written down. An API contract that was understood differently by the consuming team than by the producing team.

Code generation, as typically implemented, starts at phase three of a six-phase SDLC and generates output based on the context the developer happens to put in the prompt. That context is whatever the individual developer knows — which is, by definition, a subset of what the organization knows. It doesn’t know about the ADR that ruled out a particular pattern last quarter. It doesn’t know about the security requirement that was added to the API spec after the last incident. It doesn’t know about the naming conventions in the event schema registry.

So you get fast code that may or may not conform to your organizational standards. The speed is real. The conformance is luck.


The organizational context pyramid

The core architectural idea I introduced in the talk is what I called the Organizational Context Pyramid. It has three layers, and the insight is that value flows upward — but only if the foundation is solid.

graph TD
    A["🔺 Output Layer<br/>Auto-generated SDKs · Tests · Runbooks · Docs"]
    B["⚙️ Intelligence Layer<br/>AI Reasoning · Retrieval · Agent Orchestration"]
    C["📚 Foundation Layer<br/>OpenAPI Specs · ADRs · Event Schemas · Pattern Libraries · RFCs"]
    C --> B --> A
    style C fill:#1a1917,stroke:#2e2b27,color:#e8e6e1
    style B fill:#1a1917,stroke:#2e2b27,color:#e8e6e1
    style A fill:#1a1917,stroke:#2e2b27,color:#e8e6e1

The foundation layer is machine-readable organizational knowledge. OpenAPI and AsyncAPI specifications. Architecture Decision Records. Event schemas in a registry. A pattern library of approved design solutions. Team topology documentation. Risk catalogues. These are not documentation artifacts — they are the organizational memory that AI systems need to give consistent, contextually correct output.

The intelligence layer is where AI reasoning happens: retrieval, reasoning over the foundation, orchestration of tasks. This layer is only as good as what’s underneath it. Feed it a complete, well-maintained foundation, and it can enforce consistency across thousands of decisions. Feed it a stale Confluence dump, and you get confident-sounding noise.

The output layer is what teams actually want: generated code, tests, runbooks, documentation. The output is the exhaust, not the engine. It emerges from the layers below it. If the output is inconsistent or wrong, the fix is almost never in the output layer — it’s in the foundation.

This was the architectural principle I kept returning to in the talk: fix the foundation first. Before deploying AI tools broadly, invest in making your organizational knowledge machine-readable. Ten to fifteen ADRs. A complete inventory of API specs. A pattern library with twenty to thirty code examples. A schema registry. These are not prerequisites for AI adoption — they are prerequisites for meaningful AI adoption.


A six-phase AI SDLC

The second major idea in the talk was that the AI opportunity isn’t concentrated in phase three (writing code) — it’s distributed across the entire software development lifecycle. I laid out six phases and argued that AI adds distinct value at each one.

Requirements gathering is where the most defects originate, and where AI intervention has the highest leverage. When a developer prompts an AI with a product requirement, the AI can check that requirement against organizational terminology standards, identify missing edge cases based on historical incident patterns, and generate a structured specification with pre-populated error codes and security requirements drawn from the foundation layer. The output isn’t just faster — it’s more complete.

Architecture and design is where AI can enforce constraints that currently depend on someone knowing the right patterns and caring enough to ask. Does this design comply with our approved patterns? What’s the security posture? Does it meet data residency requirements? Does it match our observability standards? These checks can happen at design time, before any code is written, rather than in the architecture review meeting that happens six weeks later.

Development is where most AI energy currently goes, and it’s genuinely valuable — but only when the AI is generating from context rather than from scratch. The difference is whether the AI knows what service it’s working within, what related APIs exist, what security guidelines apply, what error handling patterns are approved. With that context, code generation produces type-safe clients, validation logic, and logging hooks that conform to organizational standards. Without it, you get plausible-looking code that will fail review.

Testing is where spec-driven generation becomes powerful. When your API specifications are complete and authoritative, test cases can be derived from them automatically. Historical incident data can feed into edge case generation. Contract tests can be generated to verify that integration behavior matches published specs. The testing phase becomes a function of the quality of the foundation, not of the thoroughness of individual test authors.

Deployment is where AI can eliminate the inconsistency that plagues manually authored runbooks and deployment scripts. When runbooks are generated from the same context as the service that is being deployed, they are accurate and current by construction. Health checks, monitoring dashboards, and rollback scripts can follow the same pattern.

Operations and feedback closes the loop. Incidents generate new test cases. Post-incident reviews update ADRs. Anomaly patterns refine the risk catalogue. The organization learns systematically, not just in the heads of the people who were on-call.

The flywheel this creates is: better organizational context produces better specifications, which produce better AI suggestions, which produce better code and fewer incidents, which generates better organizational context. Each cycle compounds.


What this looked like before agents were a thing

I want to be precise about timing, because the landscape has changed significantly in the eight months since that talk.

When I built out the framework I was presenting, the AI tooling available to engineering teams was predominantly in the individual productivity category: Copilot, Cursor, various chat interfaces. The idea of AI agents that could be given a task and execute it across multiple tools and systems — using structured organizational context to make decisions — was discussed in research contexts but was not a shipped product capability.

The Organizational Context Pyramid wasn’t designed around agent orchestration. It was designed around a simpler observation: even in a world of “intelligent autocomplete,” the quality of suggestions is a direct function of the quality of context. Teams that had invested in machine-readable organizational knowledge were getting dramatically better results from the same AI tools than teams that hadn’t. The pyramid was a way of naming that architecture and making the investment case explicit.

What’s happened since is that the major AI providers have arrived at the same conclusion through a different path. Anthropic has shipped Projects and agent capabilities that explicitly require organizational context to function well. GitHub’s Copilot Workspace is, at its core, a system for giving AI access to repository-level context before it generates anything. The “agent teams” concept — where multiple specialized AI agents collaborate on a task — assumes a shared context layer that all agents can query.

The layered architecture I described at AWS Community Day Dublin maps directly onto what these platforms are building. Foundation layer → retrieval and reasoning → output generation. The pyramid wasn’t prescient; it was describing the only architecture that could actually work. The platforms just needed time to catch up to the requirement.


The metrics that matter

One slide I’ve been asked about repeatedly since the talk is the metrics comparison. The metrics most organizations use to evaluate AI adoption measure activity, not outcomes.

Lines of code generated measures activity. Time saved typing measures activity. Number of AI suggestions accepted measures activity. None of them tell you whether the software being built is getting better or worse.

The metrics that correspond to the actual goals of software development are different. Post-deployment defect rate. Time from requirement to production — the whole cycle, not just the coding phase. Architectural drift from approved patterns, measured over time. Consistency of pattern application across teams. Mean time to recovery when things go wrong. Time for a new engineer to become productive.

These are harder to measure. They require instrumentation that many teams don’t have. But they are the only metrics that tell you whether AI is doing what it should — which is to make the organization’s software better, not just to make developers faster.


What I’d say differently now

The core argument — invest in the foundation, treat AI as an organizational quality multiplier, measure outcomes not activity — I stand behind entirely. If anything, the rapid evolution of agent capabilities has made the foundation more important, not less. An AI agent with access to your organizational context is powerful. An AI agent without it is just confident noise with a bigger blast radius.

What I’d add to the talk today is a warning I didn’t articulate strongly enough: the same flywheel that compounds quality upward can compound drift downward. An organization that uses AI agents without investing in the foundation layer will codify its existing inconsistencies faster than it could manually. Tribal knowledge gets embedded into generated artifacts. Design anti-patterns get propagated at scale. The feedback loop runs in reverse.

The inverse of “organizational context is a superpower” is “without it, AI just autocompletes your existing chaos.” That’s not a new idea. But the speed at which it can happen is new, and underestimated.

The question for engineering organizations in 2026 isn’t whether to adopt AI in the development lifecycle. That question was settled. The question is whether to build the foundation that makes AI adoption worthwhile, or to skip it in the hope that the tools will compensate. They won’t.