Very interesting read. I wonder if the llm is getting tripped up by the system prompt statement to not refactor code. Modifying failing code could be considered refactoring to it since it lacks understanding of the code (it only mimics understanding…) I wonder if an extended thinking model may thrive because of this.
I wonder how the system performance might change by modifying the restrictive “begging” to avoid refactoring, and if it would allow the usage of a simpler/cheaper model and also would it have helped the failed implementations .
this is great! i made the experience that, in many cases, AI simply ignored my instructions to use a TDD cycle. But i found a way to let AI and me stay in the TDD loop together:
I once wrote an extension to visualize the current TDD phase, and with a command, actively switch to the next phase (VSCode Marketplace: tdd-helper). This would help me to stay in the tdd cycle while developing. Turns out, this extension now is helping AI as well: i trigger the next phase, the extension writes the updated TDD phase to a json file, where AI will read the phase before doing anything. This way, AI will reliably do what guidelines i gave for the specific phase and i am in control when to switch to the next phase, having the time to review what it implemented.
Dear Mr Beck, thank you for the article! I'd love if you could share (maybe in a standalone article?) how a code artisan like you manages to find joy from LLM-generated code, as this is something I personally struggle with (seeing programming as art, LLM-generated code feels to me like having no soul).
Great question! I'm going to ignore the "is programming art?" debate for a later time & talk in terms of joy. I find joy & satisfaction in matching difficult patterns (this applies to art, music, & poker as well). I absolutely get this sense of pattern matching when augmented coding. A few examples:
* The genie had written a lot of code like if (Some ...). I knew it was wrong but I didn't know about the Option combinators. Eventually I found them and the subsequent refactorings felt wonderful as deeply nested conditionals melted away.
* Every time I add to the invariant tester, a few tests break, and then the genie gets them passing it feels great.
* When I think of a clever way to use a tool, like using a function size linter with a 1-line limit to extract functions lengths, that feels great.
* I still use IntelliJ's refactoring tools sometimes.
Thank you, I appreciate your insights about joy and will be looking forward on more of your future writings / talks on this topic (and the art topic), as I found it deeply enriching how your books emphasize joy and human connection and I therefore believe you're uniquely positioned to provide invaluable insights about the changes programmers are now living through.
I've started following this augmented process in work and it's already paying dividends... Our principal lead is using the term genie already! (Fully attributed of course)
At the moment I'm iterating over the genie guide to find the sweet spot for our internal guidelines but watching the genies action TDD is extremely satisfying.
b. avoid being slightly steered into the wall whenever you let your guard down?
---
My alternative — Charted Coding, but is it an alternative? or is it just Augmented Coding? 🤔 — starts with a design doc, and with the help of an MCP Server, the workflow goes like:
1. Human writes the design doc
2. Genie helps with minor tasks like drawing diagrams if necessary
3. Genie reviews the design doc — this improves the design and also raises alerts in terms of what the genie (mis)understood.
4. Genie(s) goes TDD
5. Human reviews and asks genie(s) for tidyings
(moving from one step to the next is human-driven on purpose)
Note: The design doc could include a plan of what should be tidied first to prepare the runway for the next behavior. This could even indicate tasks that can be parallelized on different genies etc...
---
c. 😅 Rereading myself, this sounds like a shameless plug, but I always wondered what your take is on design docs? (Obviously, they come with a considerable risk of falling into mini-waterfall)
Design docs have the problem that I address in the Tidying books--when do you make design decisions. If you're not going to learn anything & nothing external is going to change, then go ahead and make those decisions today. The greater the pace of change, the greater the value of deferring decisions.
That said, we are all on unstable new ground with augmented coding, so try all the things.
Actually, vibe coding comes in handy for spikes and learning input.
Spike with vibe coding
=> learn and throw away
=> write design doc to make sure everybody’s on the same page, including the genies
=> TDD
=> Tidy when necessary
=> if you hit the wall, no problem. Edit design doc and let genies adapt or throw away everything except the learnings in design doc and ask genies to try again
In the video, it actually happened to me. When the genie integrated the paginator, there were too many changes for my little brain to follow.
I needed a tidying first (adapting the data fetching services before using them)
I could have thought of the tidying timing and I didn’t, but it’s ok because the cost of change is so cheap and ego-less, that reverting and taking another path is even easier than before.
I am curious about the dynamics between humans and genies in such workflows. Will this isolate devs even more? Will devs discuss design more?
I've had pretty limited success in doing anything complex this way, but it's been helpful for getting "nice to have" / low time-value stuff done that otherwise would have lingered indefinitely in the todo list.
Babysitting it to stop it from going down rabbit trails or cheating is absolutely necessary. Left to its own devices it'll often do decent work, then destroy it, and even destroy previously working, committed code.
Between the babysitting and guidance it only gets these small tasks done about 2-3x as fast as doing it manually, but that's still sometimes fast enough to justify a feature that otherwise wouldn't make the cut.
Having it prototype in a simpler language and then translating to the production language is an interesting idea. The next time I feel like throwing 5 bucks at an "agent" I'll have to give that a try.
Having read all the posts in this "series" about your journey pair-programming with a genie, it sounds more than a little familiar to my experiences working with offshore programmers back in the early part of this century. We (my coworkers, tech leaders, others I knew at other companies going through the same) learned that you got back exactly what you asked for, nothing more and (usually) nothing less. We also came to the conclusion that, in the long run, it wasn't actually saving as much money (money.equals(time), mostly) as anyone claimed it would - for a variety of reasons.
Are we retreading a path we've all been down before? If you experienced that journey 20-something years ago, it would be an interesting read, your comparison between the two journeys so far.
I do have a couple of thoughts on Vibe Coding that may be of interest.
The first is from is from Neil Lawrence (Professor of AI at Cambridge University, ex-Head of Automation at AWS) from his recent book, ‘The Atomic Human’, on them practice of Anthropomorphism with language model prompts, eg ‘you are a senior developer… ‘ or ‘Anthoxing’ for short, is actually a major reason why errors occur, because it's the acting as a human creates the context set for the prompt to ‘leak’, in other words the language model can wonder off outside what you think the context is because somewhere in its training it may have been exposed to, this senior developer was also good at playing the violin, which let's the language model go off at a tangent.
We must remember that language models are code and data. Getting the best results requires accuracy, specificity, and clear commands. For example: "Write Python code that will perform these functions [function list] adhering to this list of standards and best practices [standards and best practice list]." Yes, this creates a longer prompt, but it greatly increases the quality of the results.
On to my second point: Claude Code, in its efforts to stay out of the way of tools and focus on compatibility and the model, has created a remarkable tool with the capability to learn—thanks to its memory—and brings AI power even to the command line, with extension possibilities via an SDK.
Vibe coding, by contrast, doesn't offer the ability to learn and therefore improve. Claude Code, while certainly built around Anthropic's Claude model, represents a different approach—and you can see that Google has also recognized this advantage with their recent Gemini CLI and therefore optimistic about the future of development.
I'm just wondering that "Loops" here means 1) AI repeating similar behaviors OR 2) loops in programming (e.g., for / while). I guess the former; am I right?
> Q. What were the warning signs that told you the AI was going off track?
Very interesting read. I wonder if the llm is getting tripped up by the system prompt statement to not refactor code. Modifying failing code could be considered refactoring to it since it lacks understanding of the code (it only mimics understanding…) I wonder if an extended thinking model may thrive because of this.
I wonder how the system performance might change by modifying the restrictive “begging” to avoid refactoring, and if it would allow the usage of a simpler/cheaper model and also would it have helped the failed implementations .
this is great! i made the experience that, in many cases, AI simply ignored my instructions to use a TDD cycle. But i found a way to let AI and me stay in the TDD loop together:
I once wrote an extension to visualize the current TDD phase, and with a command, actively switch to the next phase (VSCode Marketplace: tdd-helper). This would help me to stay in the tdd cycle while developing. Turns out, this extension now is helping AI as well: i trigger the next phase, the extension writes the updated TDD phase to a json file, where AI will read the phase before doing anything. This way, AI will reliably do what guidelines i gave for the specific phase and i am in control when to switch to the next phase, having the time to review what it implemented.
Dear Mr Beck, thank you for the article! I'd love if you could share (maybe in a standalone article?) how a code artisan like you manages to find joy from LLM-generated code, as this is something I personally struggle with (seeing programming as art, LLM-generated code feels to me like having no soul).
Great question! I'm going to ignore the "is programming art?" debate for a later time & talk in terms of joy. I find joy & satisfaction in matching difficult patterns (this applies to art, music, & poker as well). I absolutely get this sense of pattern matching when augmented coding. A few examples:
* The genie had written a lot of code like if (Some ...). I knew it was wrong but I didn't know about the Option combinators. Eventually I found them and the subsequent refactorings felt wonderful as deeply nested conditionals melted away.
* Every time I add to the invariant tester, a few tests break, and then the genie gets them passing it feels great.
* When I think of a clever way to use a tool, like using a function size linter with a 1-line limit to extract functions lengths, that feels great.
* I still use IntelliJ's refactoring tools sometimes.
Thank you, I appreciate your insights about joy and will be looking forward on more of your future writings / talks on this topic (and the art topic), as I found it deeply enriching how your books emphasize joy and human connection and I therefore believe you're uniquely positioned to provide invaluable insights about the changes programmers are now living through.
I've started following this augmented process in work and it's already paying dividends... Our principal lead is using the term genie already! (Fully attributed of course)
At the moment I'm iterating over the genie guide to find the sweet spot for our internal guidelines but watching the genies action TDD is extremely satisfying.
Very interesting!
About "intruding more on the design", how do you:
a. avoid the review fatigue?
b. avoid being slightly steered into the wall whenever you let your guard down?
---
My alternative — Charted Coding, but is it an alternative? or is it just Augmented Coding? 🤔 — starts with a design doc, and with the help of an MCP Server, the workflow goes like:
1. Human writes the design doc
2. Genie helps with minor tasks like drawing diagrams if necessary
3. Genie reviews the design doc — this improves the design and also raises alerts in terms of what the genie (mis)understood.
4. Genie(s) goes TDD
5. Human reviews and asks genie(s) for tidyings
(moving from one step to the next is human-driven on purpose)
Note: The design doc could include a plan of what should be tidied first to prepare the runway for the next behavior. This could even indicate tasks that can be parallelized on different genies etc...
---
c. 😅 Rereading myself, this sounds like a shameless plug, but I always wondered what your take is on design docs? (Obviously, they come with a considerable risk of falling into mini-waterfall)
90s Distilled video: https://youtu.be/oWYnuz2dI7I
Full video: https://youtu.be/8z9tUsSoros
Simple diagram: https://bsky.app/profile/younesjd.dev/post/3lqwtil3bj22u
Design docs have the problem that I address in the Tidying books--when do you make design decisions. If you're not going to learn anything & nothing external is going to change, then go ahead and make those decisions today. The greater the pace of change, the greater the value of deferring decisions.
That said, we are all on unstable new ground with augmented coding, so try all the things.
Actually, vibe coding comes in handy for spikes and learning input.
Spike with vibe coding
=> learn and throw away
=> write design doc to make sure everybody’s on the same page, including the genies
=> TDD
=> Tidy when necessary
=> if you hit the wall, no problem. Edit design doc and let genies adapt or throw away everything except the learnings in design doc and ask genies to try again
In the video, it actually happened to me. When the genie integrated the paginator, there were too many changes for my little brain to follow.
I needed a tidying first (adapting the data fetching services before using them)
I could have thought of the tidying timing and I didn’t, but it’s ok because the cost of change is so cheap and ego-less, that reverting and taking another path is even easier than before.
I am curious about the dynamics between humans and genies in such workflows. Will this isolate devs even more? Will devs discuss design more?
Very cool! Thank you for sharing your system prompt as well.
I've had pretty limited success in doing anything complex this way, but it's been helpful for getting "nice to have" / low time-value stuff done that otherwise would have lingered indefinitely in the todo list.
Babysitting it to stop it from going down rabbit trails or cheating is absolutely necessary. Left to its own devices it'll often do decent work, then destroy it, and even destroy previously working, committed code.
Between the babysitting and guidance it only gets these small tasks done about 2-3x as fast as doing it manually, but that's still sometimes fast enough to justify a feature that otherwise wouldn't make the cut.
Having it prototype in a simpler language and then translating to the production language is an interesting idea. The next time I feel like throwing 5 bucks at an "agent" I'll have to give that a try.
Great article, I wrote something related here
https://open.substack.com/pub/tostring/p/software-engineers-are-dead-long?r=clbp9&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
Having read all the posts in this "series" about your journey pair-programming with a genie, it sounds more than a little familiar to my experiences working with offshore programmers back in the early part of this century. We (my coworkers, tech leaders, others I knew at other companies going through the same) learned that you got back exactly what you asked for, nothing more and (usually) nothing less. We also came to the conclusion that, in the long run, it wasn't actually saving as much money (money.equals(time), mostly) as anyone claimed it would - for a variety of reasons.
Are we retreading a path we've all been down before? If you experienced that journey 20-something years ago, it would be an interesting read, your comparison between the two journeys so far.
I do have a couple of thoughts on Vibe Coding that may be of interest.
The first is from is from Neil Lawrence (Professor of AI at Cambridge University, ex-Head of Automation at AWS) from his recent book, ‘The Atomic Human’, on them practice of Anthropomorphism with language model prompts, eg ‘you are a senior developer… ‘ or ‘Anthoxing’ for short, is actually a major reason why errors occur, because it's the acting as a human creates the context set for the prompt to ‘leak’, in other words the language model can wonder off outside what you think the context is because somewhere in its training it may have been exposed to, this senior developer was also good at playing the violin, which let's the language model go off at a tangent.
We must remember that language models are code and data. Getting the best results requires accuracy, specificity, and clear commands. For example: "Write Python code that will perform these functions [function list] adhering to this list of standards and best practices [standards and best practice list]." Yes, this creates a longer prompt, but it greatly increases the quality of the results.
On to my second point: Claude Code, in its efforts to stay out of the way of tools and focus on compatibility and the model, has created a remarkable tool with the capability to learn—thanks to its memory—and brings AI power even to the command line, with extension possibilities via an SDK.
Vibe coding, by contrast, doesn't offer the ability to learn and therefore improve. Claude Code, while certainly built around Anthropic's Claude model, represents a different approach—and you can see that Google has also recognized this advantage with their recent Gemini CLI and therefore optimistic about the future of development.
The lowered mental barrier to coding (or in some cases, "coder's high") is real 😂
Thx for the great post!
I'm just wondering that "Loops" here means 1) AI repeating similar behaviors OR 2) loops in programming (e.g., for / while). I guess the former; am I right?
> Q. What were the warning signs that told you the AI was going off track?
> 1. Loops.
The genie would go into an infinite loop. I haven’t seen this behavior for a couple of weeks though so maybe they have this one solved.
Thx for the answer!