I Tried to Open Source Ideas Instead of Code
Testing whether prose can reliably capture enough intent to reproduce a tool. It can — 95.6% of the time — but the effort doesn't justify itself.
Prompt
Write up my findings from building a pattern spec and automated test harness for open-sourcing intent instead of code. Cover what I built, what I found, and the honest conclusion about whether the format justifies itself.
Response
I had a thought: what if instead of sharing a GitHub repo for a simple web tool, you shared a recipe — a short description of what the tool does and why, written well enough that someone could paste it into an LLM and get a working version? Not a copy of your code. A copy of your intent.
I called it a “pattern.” I built a spec, a website, and a test harness that automated the whole question: can prose reliably capture enough intent to reproduce a tool?
What I built
The format was simple. A pattern file with four fields: intent (what and why), how (interactions described as verifiable sentences), tools (what it’s built with), and optional fields for composition and derivation. The key idea was that every sentence in how is a test case — paste the pattern into a fresh LLM session, generate from it, check each sentence against the output.
I built an automated harness to test this at scale: 10 single-page web apps, each with pattern files at different levels of detail, tested across multiple models. The pipeline would generate from a pattern, judge the output sentence by sentence, automatically revise failed patterns, and analyze recurring failures across the whole corpus.
What I found
Hand-written patterns hit a 95.6% pass rate on behavioral equivalence. That’s actually good — 19 out of 20 sentences in a pattern successfully reproduced in the generated tool.
The remaining ~4% clustered around visual specifications: exact positioning, viewport percentages, complex state transitions. These aren’t fixable with better writing — they’re the boundary of what prose can convey about visual design.
The more interesting finding came from trying to improve the spec itself. I ran six iterations of an automated loop: test patterns, analyze failures, update the spec’s writing guidance, rewrite all patterns following the new guidance, retest. The result was counterintuitive — every automated rewrite made things worse. Patterns ballooned from ~300 sentences to 500+, and pass rates dropped. The spec’s guidance kept pushing toward over-specification, which created more test cases that failed on reasonable LLM defaults (hover states, subtle animations, standard affordances).
My manual edit to the spec — backing off the aggressive specificity and adding guidance about not over-specifying — remained the best-performing version across all iterations.
What I learned
Prose captures behavioral intent well. A few paragraphs of plain English reliably reproduces how a tool works. 95.6% is remarkable for natural language as a specification format.
Prose captures visual intent poorly. Layout, positioning, sizing, and visual details are where patterns break down. The format works for what a tool does, not what it looks like.
Specificity has a sweet spot. Too vague and the LLM fills in everything itself. Too specific and you’re fighting its reasonable defaults. The best patterns describe the interactions that define the tool and leave everything else to the LLM’s judgment.
The effort doesn’t justify itself. This is the honest conclusion. To get value from patterns, you need to maintain them during development — write them, run regeneration tests, revise the language, keep history. The payoff is that a stranger can build a behavioral equivalent of your tool. But for single-page web tools, that stranger would get a better result faster by just looking at the live tool and telling an LLM “build something like this.”
The pattern adds a telephone-game layer between the original and the copy. It captures why the tool is the way it is, which is intellectually interesting — but in practice, “copy this URL” is simpler, faster, and produces higher-fidelity results.
What was worth it
The regeneration test — describe what you want, generate it, check each sentence — is a solid workflow for any LLM-assisted development. And the finding that over-specification hurts is broadly useful: when working with LLMs, fighting their defaults costs more than working with them.
The takeaway
Open source ideas is a real concept — the intent layer is genuinely more portable and more interesting than the code layer. But the mechanism for sharing it might just be conversation, not a file format. You describe what you built and why to another person (or an LLM), and they build their own version informed by your thinking. That’s what’s always happened. The spec tried to formalize it, and the formalization added overhead without enough payoff.