This is particularly pernicious when your code is a service that executes other people's (as yet unknown) code. There are infinities of unknowns when your workload is arbitrary. We test as much as the budget will allow and monitor the shit out of everything we can in production to catch failure modes we couldn't've even imagined. People will do weird shit with your bananas that you never intended nor could anticipate.
"The lack of automated testing is a big surprise. Over time I came to realize one good reason for skipping automated testing: scale. If problems only appear at scale, then the first time you can possibly catch them is in production."
This might be the most astute observation.
Especially when software is now distributed over cloud and microservices and fast deploy.
I must admit I react fairly strongly to the phrase "Mistakes only happen in production". Certainly in a complex distributed system, there are mistakes that are found in production. I don't particularly disagree with your suggestions, but when I read the above, I find it hard not to hear a recommendation *not* to do other forms of testing.
I should add that I'm assuming by "scale", you mean Google / Facebook scales, and I've certainly never worked at those. The experience that drives my reaction is from an on-prem distributed block storage virtualization system, using external enterprise storage arrays as the underlying storage, and providing i/o access to hundreds of front-end servers. The number of nodes in play for us probably topped out at 16 or so, and we used what I guess today would be called a distributed (and I think fairly modular) monolith.
This is particularly pernicious when your code is a service that executes other people's (as yet unknown) code. There are infinities of unknowns when your workload is arbitrary. We test as much as the budget will allow and monitor the shit out of everything we can in production to catch failure modes we couldn't've even imagined. People will do weird shit with your bananas that you never intended nor could anticipate.
Slightly different context, but the same basic theme: Why Milk is Thicker Than Water https://flowchainsensei.wordpress.com/2012/11/06/why-milk-is-thicker-than-water/
That's one of the best metaphors for a compounding error I've read in a while 😁
Some folks like to make banana bread from soggy bananas. Keeps us all gainfully engaged.
"The lack of automated testing is a big surprise. Over time I came to realize one good reason for skipping automated testing: scale. If problems only appear at scale, then the first time you can possibly catch them is in production."
This might be the most astute observation.
Especially when software is now distributed over cloud and microservices and fast deploy.
Case in point: CrowdStrike.
I must admit I react fairly strongly to the phrase "Mistakes only happen in production". Certainly in a complex distributed system, there are mistakes that are found in production. I don't particularly disagree with your suggestions, but when I read the above, I find it hard not to hear a recommendation *not* to do other forms of testing.
I should add that I'm assuming by "scale", you mean Google / Facebook scales, and I've certainly never worked at those. The experience that drives my reaction is from an on-prem distributed block storage virtualization system, using external enterprise storage arrays as the underlying storage, and providing i/o access to hundreds of front-end servers. The number of nodes in play for us probably topped out at 16 or so, and we used what I guess today would be called a distributed (and I think fairly modular) monolith.