I have developed a library once that was in production under very high and diverse load in big tech. The library had only one minor bug that was found after a full year in production. Did it take me a long time to write that? No, it was actually pretty fast (maybe 2 months of actual coding). What’s the secret? Religious application of test driven development from start to finish.
In other words, I had lots of bugs, but they all showed up upfront, before production. The strong test coverage also allowed me to do all kinds of refactoring with confidence enabling me to move faster when I discovered structural problems or did performance optimizations based on profiles from prod. All of this to say that TDD is great for reducing defects before they hit production…
Do you know if those XP teams used TDD in addition to pairing?
I want more developers to truly understand TDD. To get the point of it. Maybe those who aren't using it can ask themselves some serious questions like these:
1. How do you currently know the acceptance criteria is sound while coding?
2. How do you currently ensure the code you're currently writing meets all of that criteria?
3. What are the benefits of doing it the way you're doing?
4. What do you see as the number 1 drawback to TDD?
Same for >1p programming and the shared code principle (including branches, the words "my/your/<name>'s" and "branch" don't belong together). Software development works so much better as a team activity. Maybe there are plenty of devs who don't want to work with others for whatever reasons, but that's a people (incentives?) problem not a "what works better" problem.
Reflecting more on that experience, I think I was motivated by the fear of failure. The project was ambitious, the timeline was tight. And despite being a young hotshot engineer with a few notches on my belt, this project felt daunting. It was technically complex, so correctness alone was not trivial, it had to run at scale, so a naive implementation would not cut it, and I didn’t know where hotspots would be since I had never done this type of work before. So I had to build time in for integration, profiling, and refactoring without compromising correctness.
So maybe engineers aren’t afraid enough to use TDD? I have certainly been overconfident “oh, I can just hack this together!” And it worked fine for some things. And for others, I have followed what I call “backwards debugging sequence”:
1) write code
2) realize that it does not exactly do what I want
3) spend time debugging
4) realize that I don’t know exactly what I want the code to do
5) change the code, if I still have confidence/arrogance left, go back to 2
6) give up, and write a test to specify exactly what I want
7) fix the code until the test passes
And this process usually ends up taking a lot longer than if I just started by writing a test (and faced the fact that I need to figure out exactly what I want first)
I think the other reason I have sometimes avoided testing in general (not just TDD) is poorly structured legacy code with a lot of side-effects/state dependencies. Making tests for this kind of codebase is hard because you have to refactor it first and separate out the pure functional code (that is testable) from the code that depends on external state. So, if I could get away with a hit-and-run fix with a manual test, I am guilty of having contributed to the mess.
Maybe AI will make it easier to modify 10s of functions used all over the codebase to pass the state as a parameter instead of the reading global state in a number of places?
One other thing I've observed about bugs in the forest is that they often generate a lot of curiosity. When the rare report of a Genuine Production Bug appears, it's common for lots of people to voluntarily down tools and huddle round to see what this interesting, exotic specimen is.
I have witnessed once that bugs on production is a choice rather than something inevitable. The choice was driven by perception that only client reported bugs are worth fixing and finding and fixing bugs upfront is hard and constly - there were only manual tests. It was an environment where costs were important factor in how product was developed - something that I rarely see taken into account.
Came here to also comment about perception. In my experience a lot of the time a bug is more of a "new scenario" or something that wasn't fully considered. So it's much easier to treat it as a feature extension rather than a bug and ask questions like "Do you want this doing now? Later? Never?". Customers are also more appreciative of a new scenario being put in place "just for them". So it's no longer a bug and that makes it easier to manage
On one hand you have the team wasting time on bug tracking to justify problems reported by the customer to the management.
On the other hand you have customer/management unsatisfied by the amount of bugs and the slow pace of development, so they keep adding pressure.
It's an equilibrium, and it can only get better if both sides are open to change the way they work. But if one side changes and the other doesn't, the situation gets worse for those who embraced the change.
They're surviving, and it's hard to let go of a habit when your life depends on it.
I lead a small eng team (well maybe organization at this point...) I would love to get to no bugs in production. We are definitely in a Desert mindset. How do I do this??
For one defect, spend one hour digging into it. Create a time line of the events leading to it. Ask if there was a test that would have caught it. If not, ask what the design would have to be so that there could be a test that would have caught it.
If this proves to be valuable, do it once a week. Start explicitly prioritizing a little bit of the follow up activities you uncover.
It depends a lot on what is keeping there from being zero bugs in production.
If you have the authority, one thing to try is saying explicitly: "every bug is a problem. I don't care about schedules: I care about our confidence in our software." This is especially useful alongside a retro like Kent describes.
It is possible that people don't know you think the status quo is a problem. They might disagree that bugs are more important than schedules. Or maybe they do agree, but...
Depending on what forces are pushing back, there are a bunch of different techniques I've seen help. But regardless of the specific problem or solutions, you'll be able to tell when you are making progress: engineers will get together & talk amongst themselves about how they are making things better.
The real trick is how to get started and that depends on current state of things.
Having dependable, reliable, automatic tests that validate all use cases is a huge head start. If you don't have that, start writing automated acceptance tests for all new features *as part of the software engineering practice. If it's a new skill to the team, really learn and understand the skill first. It's easy to paint yourselves into a corner by writing unit tests to cover things at a low level at the cost of flexibility. The most important tests cover the system at its entry points (API or User Interface). They cover how the system should behave given some state and action. If you only have these tests, the implementation details can change **mostly safely.
*so you just do it autonomously instead of something the powers that be outside of engineering have to approve. It's part of the engineering discipline, not a QA function.
**because some changes will affect resource usage such as CPU, MEM, Network, Disk IO which aren't typically covered by acceptance tests. Also, some bugs are due to async timing issues and those are difficult to test for in advance.
I had a small epiphany when you provided the context of desert and forest together with the whole no bugs approach. A manager once came.with this no bugs strategy but we were deep in the desert and could not conceive how that would solve anything. (It didn't cause we had too many bugs to drop everything all the time) But with the context of being in the forest it makes a lot of sense to drop everything and fix whatever is found. It's also possible. So thanks for that.
Yeah exacly, and I wanted to add that this explains the phenomenon I've seen multiple times where some new approach is presented, but people presenting it don't differentiate the contexts of both places. They try to transplant the approach based on what they did back at the old company.
I’d be interested to know if the rare bugs that *are* encountered tend to be sibs of commission (the wrong thing is done; you should have known better) or sins of omission where you find yourself in a state that somehow wasn’t anywhere on the menu? [I may have these wrong, not being of that pew.]
I wonder about the utility of semiformal methods to get ahead of such things or at least having some idea of where they might lurk.
I have developed a library once that was in production under very high and diverse load in big tech. The library had only one minor bug that was found after a full year in production. Did it take me a long time to write that? No, it was actually pretty fast (maybe 2 months of actual coding). What’s the secret? Religious application of test driven development from start to finish.
In other words, I had lots of bugs, but they all showed up upfront, before production. The strong test coverage also allowed me to do all kinds of refactoring with confidence enabling me to move faster when I discovered structural problems or did performance optimizations based on profiles from prod. All of this to say that TDD is great for reducing defects before they hit production…
Do you know if those XP teams used TDD in addition to pairing?
I want more developers to truly understand TDD. To get the point of it. Maybe those who aren't using it can ask themselves some serious questions like these:
1. How do you currently know the acceptance criteria is sound while coding?
2. How do you currently ensure the code you're currently writing meets all of that criteria?
3. What are the benefits of doing it the way you're doing?
4. What do you see as the number 1 drawback to TDD?
Same for >1p programming and the shared code principle (including branches, the words "my/your/<name>'s" and "branch" don't belong together). Software development works so much better as a team activity. Maybe there are plenty of devs who don't want to work with others for whatever reasons, but that's a people (incentives?) problem not a "what works better" problem.
Reflecting more on that experience, I think I was motivated by the fear of failure. The project was ambitious, the timeline was tight. And despite being a young hotshot engineer with a few notches on my belt, this project felt daunting. It was technically complex, so correctness alone was not trivial, it had to run at scale, so a naive implementation would not cut it, and I didn’t know where hotspots would be since I had never done this type of work before. So I had to build time in for integration, profiling, and refactoring without compromising correctness.
So maybe engineers aren’t afraid enough to use TDD? I have certainly been overconfident “oh, I can just hack this together!” And it worked fine for some things. And for others, I have followed what I call “backwards debugging sequence”:
1) write code
2) realize that it does not exactly do what I want
3) spend time debugging
4) realize that I don’t know exactly what I want the code to do
5) change the code, if I still have confidence/arrogance left, go back to 2
6) give up, and write a test to specify exactly what I want
7) fix the code until the test passes
And this process usually ends up taking a lot longer than if I just started by writing a test (and faced the fact that I need to figure out exactly what I want first)
I think the other reason I have sometimes avoided testing in general (not just TDD) is poorly structured legacy code with a lot of side-effects/state dependencies. Making tests for this kind of codebase is hard because you have to refactor it first and separate out the pure functional code (that is testable) from the code that depends on external state. So, if I could get away with a hit-and-run fix with a manual test, I am guilty of having contributed to the mess.
Maybe AI will make it easier to modify 10s of functions used all over the codebase to pass the state as a parameter instead of the reading global state in a number of places?
TDD is one of the practices that define XP
One other thing I've observed about bugs in the forest is that they often generate a lot of curiosity. When the rare report of a Genuine Production Bug appears, it's common for lots of people to voluntarily down tools and huddle round to see what this interesting, exotic specimen is.
I have witnessed once that bugs on production is a choice rather than something inevitable. The choice was driven by perception that only client reported bugs are worth fixing and finding and fixing bugs upfront is hard and constly - there were only manual tests. It was an environment where costs were important factor in how product was developed - something that I rarely see taken into account.
Came here to also comment about perception. In my experience a lot of the time a bug is more of a "new scenario" or something that wasn't fully considered. So it's much easier to treat it as a feature extension rather than a bug and ask questions like "Do you want this doing now? Later? Never?". Customers are also more appreciative of a new scenario being put in place "just for them". So it's no longer a bug and that makes it easier to manage
It's the prisoner's dilemma, isn't it?
On one hand you have the team wasting time on bug tracking to justify problems reported by the customer to the management.
On the other hand you have customer/management unsatisfied by the amount of bugs and the slow pace of development, so they keep adding pressure.
It's an equilibrium, and it can only get better if both sides are open to change the way they work. But if one side changes and the other doesn't, the situation gets worse for those who embraced the change.
They're surviving, and it's hard to let go of a habit when your life depends on it.
I’ve been meaning to write up Prisoners Dilemma as a Thinkie but I lacked an example. Thank you 🙏
I lead a small eng team (well maybe organization at this point...) I would love to get to no bugs in production. We are definitely in a Desert mindset. How do I do this??
For one defect, spend one hour digging into it. Create a time line of the events leading to it. Ask if there was a test that would have caught it. If not, ask what the design would have to be so that there could be a test that would have caught it.
If this proves to be valuable, do it once a week. Start explicitly prioritizing a little bit of the follow up activities you uncover.
But start with that one hour. You can do that.
It depends a lot on what is keeping there from being zero bugs in production.
If you have the authority, one thing to try is saying explicitly: "every bug is a problem. I don't care about schedules: I care about our confidence in our software." This is especially useful alongside a retro like Kent describes.
It is possible that people don't know you think the status quo is a problem. They might disagree that bugs are more important than schedules. Or maybe they do agree, but...
Depending on what forces are pushing back, there are a bunch of different techniques I've seen help. But regardless of the specific problem or solutions, you'll be able to tell when you are making progress: engineers will get together & talk amongst themselves about how they are making things better.
The real trick is how to get started and that depends on current state of things.
Having dependable, reliable, automatic tests that validate all use cases is a huge head start. If you don't have that, start writing automated acceptance tests for all new features *as part of the software engineering practice. If it's a new skill to the team, really learn and understand the skill first. It's easy to paint yourselves into a corner by writing unit tests to cover things at a low level at the cost of flexibility. The most important tests cover the system at its entry points (API or User Interface). They cover how the system should behave given some state and action. If you only have these tests, the implementation details can change **mostly safely.
*so you just do it autonomously instead of something the powers that be outside of engineering have to approve. It's part of the engineering discipline, not a QA function.
**because some changes will affect resource usage such as CPU, MEM, Network, Disk IO which aren't typically covered by acceptance tests. Also, some bugs are due to async timing issues and those are difficult to test for in advance.
I had a small epiphany when you provided the context of desert and forest together with the whole no bugs approach. A manager once came.with this no bugs strategy but we were deep in the desert and could not conceive how that would solve anything. (It didn't cause we had too many bugs to drop everything all the time) But with the context of being in the forest it makes a lot of sense to drop everything and fix whatever is found. It's also possible. So thanks for that.
Thanks for noting that talking across the chasm is truly difficult. Assumptions in one sound like fantasies in the other.
Yeah exacly, and I wanted to add that this explains the phenomenon I've seen multiple times where some new approach is presented, but people presenting it don't differentiate the contexts of both places. They try to transplant the approach based on what they did back at the old company.
I’d be interested to know if the rare bugs that *are* encountered tend to be sibs of commission (the wrong thing is done; you should have known better) or sins of omission where you find yourself in a state that somehow wasn’t anywhere on the menu? [I may have these wrong, not being of that pew.]
I wonder about the utility of semiformal methods to get ahead of such things or at least having some idea of where they might lurk.
I'd love to see data on this. Not "what are the daily swarms of bugs?" but "when there's one a quarter, what does it look like?".