A practical account of prompt chaos, structured workflows, and how we eventually got it under control.
Every engineering team using AI tooling goes through a version of the same arc. It starts with genuine excitement — you give Copilot a task, it does something surprisingly useful, and suddenly everyone's pairing with it. Then, quietly, the cracks appear.
We build MCP Express — infrastructure that connects AI agents to databases, APIs, and services. Getting there meant integrating over thirty upstream services ourselves, and this is the account of how that actually went: how vibe coding worked for us, where it started to break, what we tried, what didn't fit, and what finally held.
Phase One: Vibe Coding With Copilot
Our team adopted GitHub Copilot early. The productivity gains were real — boilerplate vanished, documentation stubs wrote themselves, and junior engineers stopped getting stuck on syntax and started getting stuck on architecture, which is the right problem to be stuck on.
We were building MCP Express, which meant integrating a growing list of upstream services — databases, REST APIs, cloud infrastructure, developer tools. The list kept growing. And for a while, each engineer would open a chat window with Copilot and just... figure it out. Describe the service, describe what they needed, ask Copilot to help wire it up.
The workflow was iterative and conversational — prompt, review, adjust, ship. And for individual tasks in isolation, it worked.
What we hadn't accounted for was what it looked like across the team.
The Consistency Problem
Each integration worked. But they didn't look like they came from the same codebase. Error handling patterns differed. Some integrations had retry logic, others didn't. Logging was inconsistent — some threw structured JSON, others plain strings. We'd even ended up with two different HTTP client libraries in the same project — not through disagreement, just through three separate conversations that never referenced each other.
None of this was negligence. It was the natural output of three engineers having three separate conversations with an AI that had no knowledge or memory of how the other integrations were built. Every prompt started from zero. Every engineer inadvertently re-decided questions that had already been decided.
We also noticed a quieter problem: engineers were writing the same contextual prompts over and over. "Here's our error handling pattern, here's the interface shape we expect, here's the logging format we use, now help me integrate X." That context had to be re-established every single time, for every single integration. It was productive, but it was wasteful in a way that compounded.
We needed standardisation. We started looking for ways to get there.
Phase Two: We Tried Spec Kit
The obvious candidate was Spec Kit — an open-source toolkit for what they call Spec-Driven Development (SDD). It's hard to ignore: the repository has accumulated over 92,000 stars and 8,000 forks, making it one of the fastest-growing developer tools of the past year. The pitch is compelling. Rather than treating specifications as throwaway documents, Spec Kit makes them first-class artifacts — a methodology where specifications become the source of truth that drives implementation, rather than scaffolding you discard once coding begins.
The workflow follows a structured sequence: /speckit.specify → /speckit.plan → /speckit.tasks → /speckit.implement. Instead of vibe coding every new feature, teams preemptively outline concrete project requirements, motivations, and technical aspects before handing that off to AI agents. On paper, this was exactly what we needed.
We put it through its paces on several integrations. The results were instructive.
What Spec Kit Gets Right
The output quality was genuinely impressive. When Spec Kit runs well, the implementation is coherent in a way that ad hoc prompting rarely achieves. The specification phase forces ambiguity to the surface before any code is written. The constitutional validation layer meaningfully reduces drift. For teams building novel, complex, interconnected systems — or for greenfield projects where requirements are genuinely unclear — Spec Kit earns every one of those stars.
We want to be clear about that before we get into where it fell short.
Where It Struggled for Our Case
Token consumption at scale. Before a single line of implementation, Spec Kit generates requirements documents, architecture notes, dependency maps, clarification exchanges, and validation reports. For a complex feature, this overhead is justified. For a tightly scoped, repeatable task — which every one of our integrations was — it starts to feel like commissioning a full architectural review every time you want to add a room. Multiply that across thirty-plus integrations and you're burning significant context on ceremony that doesn't move the implementation forward.
Doc-as-code is a feature, not a bug — but it has a cost. Spec Kit's philosophy that specifications are living artifacts and code is their expression is genuinely valuable — particularly for long-lived features, regulated environments, or large distributed teams where the spec needs to outlive the sprint. For our integration work the calculus was different. Each integration had a tight feedback loop: validate, adjust, ship. The implementation itself quickly became the clearest record of what was built. Maintaining full specification documents on top of that added ceremony without adding clarity for that specific type of work.
The full workflow is designed for problems where the plan and tasks are unknowns — ours weren't. Spec Kit's constitution file is actually a good mechanism for encoding team conventions: you define your standards there once, and every subsequent run is bound by them. That part works. The problem for us was structural, not configurational. The /speckit.specify → /speckit.plan → /speckit.tasks → /speckit.implement sequence assumes that each stage is genuinely uncertain and needs to be reasoned through fresh. For our integrations, that wasn't true.
Every integration we built followed the same plan — and that wasn't coincidence, it was architecture. Our system treats integrations as plugins, each one slotting into the same position in the stack. The input format is fixed. The output shape is fixed. An integration doesn't get to decide how it receives a request or how it returns a result — that's determined by the layer above it. Which meant the error handling, the retry logic, the interface contract — all of it was identical across every integration we built. The only thing that actually varied was the technical surface of the upstream service: some exposed a REST API, some GraphQL, some an SDK. That single variable drove all the meaningful differences. Everything else was already decided.
Running the full Spec Kit workflow meant generating a fresh plan, a fresh task breakdown, and a fresh specification on every integration, when those artefacts were going to look nearly identical every time. We weren't discovering anything in those phases. We were re-confirming decisions that had already been made. The workflow was designed for genuine uncertainty; we had a repeating pattern with one variable. The fit was wrong, and the token cost of confirming that mismatch on every run added up.
Phase Three: Building a Custom Copilot Skill
We took a step back. The insight that emerged was simple but clarifying: Spec Kit is a tool for navigating uncertainty. We didn't have uncertainty — we had repetition. The plan was always the same. The tasks were always the same. What we needed wasn't a framework that reasoned through those questions on every run — it was a framework that already had the answers baked in, and only asked about the one thing that actually changed: the technical surface of the upstream service.
GitHub Copilot's agent mode supports custom skills — structured, invokable commands that you define and that Copilot executes within VS Code. They're a relatively underused feature, but they're exactly the right primitive for this problem. A skill lets you encode your team's knowledge — your conventions, your templates, your definition of done — and make it invocable with a single command, with no re-prompting required.
We built a two-command skill around the same specify → implement pattern that makes Spec Kit compelling, but scoped entirely to our problem.
The specify Command
When an engineer starts a new integration, they run specify. The skill doesn't generate a blank canvas — it runs a structured interview that asks only about the things that vary between integrations.
It asks about the upstream API or library: authentication method, rate limits, known failure modes, pagination behaviour, SDK quirks the engineer has already discovered through their research. It asks about prerequisites — which packages are required, what environment variables need to be set, any upstream account configuration steps. The engineer fills in the answers based on their actual research.
The output is a compact specification that captures precisely the integration-specific decisions. Everything else — error handling approach, retry semantics, logging format, the interface contract — is already encoded in the skill. The engineer doesn't specify those things because they're not decisions anymore. They're standards.
This is structurally similar to Spec Kit's constitutional approach: there's a validation layer that checks the spec against defined criteria before implementation begins. But instead of a general-purpose constitution, it's our constitution, encoding the specific decisions our team has already made and doesn't need to revisit.
The implement Command
Once the spec passes validation, implement takes over. It has two inputs: the spec just created, and a set of templates that encode our integration standards. These templates aren't suggestions — they're the structural skeleton every integration has to fit.
The AI fills in the integration-specific logic: the actual API calls, the data transformation, the authentication handling. The templates enforce the shape of the result. Error handling is consistent. Logging is consistent. The interface contract is identical across every integration. The only thing that varies is the part that should vary.
Validation
After implementation, a validation pass checks the output against our standards — does it handle the specific error codes this API is known to return? Are the retry semantics correct? Does the interface match the contract the rest of the system expects?
This isn't an AI grading itself on a general rubric. It's checking against a specific, team-authored checklist.
What This Actually Changed
The qualitative improvement was immediate. Engineers stopped re-establishing context on every integration. Onboarding a new integration went from a variable, open-ended conversation to a structured process with a predictable output shape. Code review for integrations became faster because reviewers knew exactly what to expect.
On token consumption, the difference was significant. Where a full Spec Kit run on an integration might consume several thousand tokens across its specify → plan → tasks → implement phases — before generating a single line of production code — our skill's specify phase consumed a fraction of that, with implementation drawing only on the templates and the compact spec. For teams doing this at volume, that's not a trivial difference.
The deeper win was standardisation. The codebase now reads like it was built by one careful engineer rather than three engineers in three separate conversations.
The Honest Comparison
Neither approach is universally better. They're optimised for different problems.
Reach for Spec Kit when: requirements are genuinely ambiguous, you're building novel features on greenfield projects, you need thorough documentation as a deliverable in itself, or you're working in a domain where the specification genuinely needs to precede the technical decisions.
Build a custom skill when: you already know your standards, you're doing a family of similar tasks at volume, your team has established conventions that aren't being enforced consistently, or you're paying a generality tax on every run that isn't buying you anything.
The instinct behind both approaches is correct: unstructured AI prompting produces inconsistent results, and structure is a forcing function for quality. What's worth examining is whether someone else's implementation of that structure fits your specific problem.
Spec Kit is the right tool when uncertainty is your enemy — when requirements are ambiguous, when the plan genuinely needs to be reasoned through, when you're building something novel and the specification phase will surface decisions you didn't know you had to make.
A custom skill is the right tool when repetition is your enemy — when you already know the plan, when the tasks are predictable, when you have a family of similar work that shares 90% of its structure and only varies in one or two dimensions. In that case, encoding your knowledge into a skill removes the overhead without losing the discipline.
Spec Kit is a tool for navigating uncertainty. Custom skills are a tool for eliminating redundancy. Both are structured alternatives to vibe coding — the question is which kind of structure your problem actually needs.
Building Your Own
If you're in a similar position — repeatable tasks, established standards, a team that keeps re-prompting the same context — Copilot skills are worth the investment. The VS Code documentation on custom agent skills is a good starting point.
If the AI keeps re-discovering things your team already knows, it might be worth asking whether a general framework is still the right fit — or whether encoding your own standards into a skill would serve you better.
We went through that process ourselves and learned a lot in the specifics — how to structure the templates, what makes good validation criteria, and how to run the interview questions in the specify phase. Happy to share more if you're heading down the same path. Drop us a line at hello@mcp-express.com.