Behavior-driven prompting: PRD to BDD to living spec

Most AI coding workflows start wrong. You open a chat, describe a feature, and let the agent figure it out. It works for small things. For anything with real complexity, it produces code that technically runs but doesn’t match what you actually needed. Google’s 2025 DORA Report found that 90% AI adoption growth correlated with a 9% increase in bug rates and 91% more time spent in code review. LinearB data shows 67.3% of AI-generated PRs get rejected versus 15.6% for manually written code.

The problem isn’t the models. It’s the prompts. Or more precisely, it’s the absence of structured intent between “I want a feature” and “write the code.”

The flow: PRD → BDD → task specs

The workflow that consistently produces better results follows three stages.

Stage 1: Start with a PRD

A Product Requirements Document doesn’t need to be a 40-page enterprise artifact. It’s a short document that answers: what is this feature, who is it for, and what does success look like? The key is capturing outcomes, not implementation.

A PRD for a search feature might read:

Users can search their documents by keyword. Results appear within 500ms. Results are ranked by relevance with the search term highlighted. Queries with no matches show an empty state with suggested alternatives.

This is the what. Nothing about Elasticsearch versus SQLite full-text search. Nothing about React components. Just behavior.

Research backs this up. Addy Osmani’s analysis of effective agent specs found that GitHub’s study of over 2,500 agent configuration files revealed most fail because they’re too vague — and that even applying structured prompts with explicit technical constraints produces measurably better code than a plain PRD alone.

Stage 2: Decompose into behaviors

This is where BDD earns its keep. Take each requirement from the PRD and express it as a behavior in Given/When/Then format:

Given a user has 50 documents
When they search for "quarterly report"
Then results containing "quarterly report" appear within 500ms
And the search term is highlighted in each result

Given a user searches for "xyznonexistent"
When no documents match
Then an empty state is displayed
And three suggested alternative queries are shown

Why this format? Three reasons.

It’s testable by definition. Each behavior maps directly to an acceptance test. The agent doesn’t have to guess what “working search” means — it has concrete scenarios to satisfy.

It constrains the agent’s scope. Research on the “curse of instructions” shows that model performance degrades significantly as requirements pile up in a single prompt. Breaking a feature into discrete behaviors keeps each task focused. A UC San Diego/Cornell study from 2025 confirmed that professional developers who succeed with AI agents deploy explicit control strategies — structured decomposition being chief among them.

It’s readable by everyone. Product managers, designers, and QA can review behaviors without reading code. Alignment happens before the first line is written, not after the PR is up.

Stage 3: Break behaviors into task specs

Each behavior becomes one or more implementation tasks. These are the actual prompts your agent receives:

## Task: Implement document search endpoint

### Behavior

Given a user has documents, when they search by keyword,
results are returned within 500ms ranked by relevance.

### Constraints

- Use existing PostgreSQL full-text search (no new dependencies)
- Return max 20 results per page
- Include snippet with highlighted match

### Files likely affected

- src/api/search.ts
- src/db/queries.ts

The task spec is narrow, concrete, and references the behavior it implements. The agent knows what success looks like because the behavior definition tells it.

The living spec: document what is, not what should be

Here’s where this workflow diverges from traditional spec-driven development. Most specs describe what the system should do. They’re aspirational documents that drift from reality the moment code ships.

Instead, maintain a spec file that describes what the system currently does. Every time a behavior is implemented, the spec gets updated — not with the requirement, but with the actual behavior as built. Think of it as a system-of-record, not a system-of-intent.

## Search (implemented 2026-02-15)

- Full-text search via PostgreSQL `ts_vector` on document body and title
- Returns top 20 results ranked by `ts_rank`
- Response time: p95 under 400ms on 10k document corpus
- Empty state shows 3 alternative queries generated from document tags
- Highlighting uses `ts_headline` with `<mark>` tags

This serves three purposes:

Agent context. When the agent works on a related feature later, it knows exactly how search works today — not how someone hoped it would work six months ago. Vercel’s agent eval research showed that persistent context achieves 100% pass rates versus 53% for on-demand retrieval. Your living spec is that persistent context.

Drift detection. When the spec says one thing and the tests say another, something changed without being documented. The spec becomes a changelog of system behavior.

Onboarding. New developers (and new agent sessions) can read the spec and understand the system as it actually exists. Thoughtworks’ analysis of spec-driven development identifies this as one of the key practices separating teams that scale AI-assisted development from those that generate chaos faster.

Why PRD → BDD → task specs works

The flow succeeds because each stage reduces ambiguity for the next:

Stage	Ambiguity level	Audience
PRD	High — outcomes only	Stakeholders, product
Behaviors	Medium — testable scenarios	Everyone
Task specs	Low — concrete implementation scope	Agent

Each transition is a forcing function. You can’t write a behavior without understanding the requirement. You can’t write a task spec without understanding the behavior. By the time the agent receives its prompt, most of the hard thinking is done.

The data supports this. Augment Code’s research on multi-agent spec-driven systems found that maintaining alignment between specifications and implementation is the single largest factor in code generation quality. Academic research shows a 34.2% reduction in task completion time when structured agentic workflows replace ad-hoc approaches.

Meanwhile, Stack Overflow’s 2025 survey found that 72% of professional developers say vibe coding is not part of their professional work. The developers shipping real software are already doing some version of this — they just might not be calling it BDD.

Getting started

You don’t need a framework or a new tool. Start with three files:

prd.md — What are we building and why?
behaviors.md — Given/When/Then scenarios for each feature
system-spec.md — What the system currently does (updated after each implementation)

Write the PRD. Decompose it into behaviors. Hand the agent one behavior at a time as a task spec. After each implementation, update the system spec with what was actually built.

The overhead is minimal. The reduction in wasted cycles — rejected PRs, misunderstood requirements, agent hallucinations — is not.