Fool Proof Design

Mar 17, 2023

Why do we change the design of the software? So we can change the behavior of the software more easily. If we didn’t need to change the behavior, we wouldn’t need to change the design (unless we’re changing the design as a learning exercise or because we have nothing better to do).

We’ve talked frequently about the cost reduction aspect of design change, but always from the perspective of making the changes we want to make. There’s a lurking cost, however, that I’m starting to realize deserves more attention.

Errors = $$$

The system behaves however it behaves. We want it to behave differently. We change the design so changing the behavior is cheaper. Either this pays off or it doesn’t (we mostly want design changes to pay off).

Here’s the alternative cost saving. The system behaves however it behaves. We want it to behave differently. We change the behavior. Something goes wrong. Ouch! 🤕

The tidy first/software designer response goes—we see the potential for errors so we improve the design to eliminate the source of errors (more about this later). Now when we change the behavior, we don’t make the error, we don’t end up on the front page of the New York Times, we don’t get sued, the team stays confident, users still trust us, & we saved a bunch of money.

Example: Privacy

A large software company develops many products. Each team cares about privacy but they code their own privacy code. Privacy code tends to be fiddly with lots of corner cases. Mistakes are made. Journalists enquire.

Software design to the rescue—we’ll build a single abstraction that everyone can use for privacy. The typical S curve ensues—a few usages of the new abstraction, then a flood as folks get comfortable with it & the abstraction grows in capability, then a long tail.

Some cost saving from the new privacy lens comes because feature developers no longer have to code their own privacy code. Much of the savings, though, comes from errors not made.

A few holdouts claim that their oh-so-special feature needs bespoke privacy code. Errors ensue (those pesky corner cases). Journalists enquire. The dictate comes down—no more exceptions! If you need functionality the central abstraction doesn’t offer, then add it to the central abstraction.

Adding functionality to the central abstraction requires donning robes, burning incense, sacrifices. A mistake there is now a mistake for everybody.

Justification

We’ve just started talking about the justification for software design as a way to reduce the cost of implementing features. Software design is hard to justify because the intended effect is delayed & negative. It’s the money we don’t spend on the next feature or the five after that makes software design economical.

Software design to reduce errors is even harder to justify. The return on investment comes when we don’t have to spend time/money/energy/confidence/trust repairing errors that didn’t happen.

Next?

Collect stories of software design as error reduction.
Explain the principles of error reduction. For example, I think there is a class of errors resulting from data becoming inconsistent & design responses to prevent this.
Go back to my insurance friends for an economic model for software design as error reduction, much as we have explored discounted cash flow & optionality as economic models for software design as feature acceleration.

Discussion about this post

Allan Mertner

Mar 17, 2023

> Adding functionality to the central abstraction requires donning robes, burning incense, sacrifices. A mistake there is now a mistake for everybody.

Yes indeed, this can be pretty nasty. Are you going to talk about how to mitigate this downside?

Expand full comment

Esko Luontola

Mar 18, 2023

> Collect stories of software design as error reduction.

I inherited an application whose main job is to show on one page all information which the company has about a customer. The data is collected from numerous other internal systems, and it's an essential tool for customer support and sales.

In the original design, the data would arrive from the source systems as CSV files to S3. Then the app would copy the data to its own Elasticsearch database, one table per file, about dozen tables in total. Some of the data was also joined with data from our data warehouse before writing to Elasticsearch. Then when a user does a search, the app will do a full text search over all the columns of all the tables in Elasticsearch.

This design had numerous issues which affected the users:

- When the data format of the source CSV files changed, people would usually remember to update another app which copies the data to our data warehouse, but nobody remembered to update this app which read the same CSV files. Sometimes this broke a data integration for one table. And in any case there was duplicate effort in maintaining two systems doing the same thing.

- If there was a failure in loading one of the CSV files, it would result in some of the customers' data being partially out of date. Doing a full refresh of all data would have taken over 10 hours, so it was done in only exceptional situations and only one table at a time.

- One time the filename for a table's CSV files changed, leading to the app no longer receiving new data for that table (i.e. there was no crash, but the updates just silently stopped happening). It took 4 months until a user noticed and reported that the data was stale.

- When the data in the warehouse's joined tables changes, it would not be updated to Elasticsearch until the rows that refer to that data are also updated and they are included in a CSV delivery.

- Searches were very slow, often around 10 seconds. In Elasticsearch it's easy to do a query over all columns by specifying just one parameter, but under the covers it will do one search index lookup per column. Since there were hundreds of columns, including many columns this app didn't use, even if each column took just a few milliseconds to search, the whole query took many seconds.

I redesigned it to work like this:

- There is one SQL query which reads from the data warehouse and produces one JSON document per customer. The data comes from over 20 tables and a couple hundred columns, so the SQL query is quite big: around 400 lines of SQL, and 2-3 times as much test code for testing it (e.g. populate each table and compare the resulting JSON document for equality). The query takes nearly 10 minutes to run and returns some tens of gigabytes of data when loading all the few million customers.

- In Elasticsearch there is one table for all customer data. It contains just a few columns: (1) customer ID, (2) a gzipped JSON document with all customer data, (3) a small JSON document with the subset of customer data that is shown in search results, (4) a full-text searchable text field with a concatenation of all customer fields that somebody could possibly want to search (i.e. a couple dozen fields that uniquely identify a customer or one of their contracts).

- Each night, the app recreates the Elasticsearch tables from the data warehouse, which takes about one hour for the customer table. This also gets rid of the data of any deleted customers. Throughout the day, it checks the source tables for updates and refreshes only the customers whose data has changed.

This resulted in multiple benefits:

- Since the data is loaded directly from the data warehouse, there are multiple other people who are already keeping an eye on the data integrations and making sure that the data is up to date.

- The app is now self-healing. Even if the app would not detect some changes to the source tables (for example if somebody does a manual update and forgets to update the timestamp columns, or there is a bug), all data will be recreated latest the next night. Also every piece of a customer's data will be equally up to date with the warehouse data, because a customer's JSON document will always be replaced as a whole, instead of updating individual tables that contain only part of the data.

- The searches are much faster: p95 went from 6 seconds to 40 milliseconds. Searching one big column is much faster than hundreds of small columns. Also the app can just return the JSON documents as-is to the UI, instead of joining the data from multiple tables. The backend doesn't even need to parse the JSON: the full customer document is returned in its precompressed form with a HTTP content-encoding gzip header, and the search response can be constructed safely with string concatenation instead of needing a fully fledged JSON generator.

- Server costs were reduced by 80%, because it was possible to use much smaller server instances than before.

- The app's maintainability increased. For example from the customer query's tests you see the customer JSON document with every possible field populated, and also the subset JSON documents for the search results and the searchable fields. Previously you would have had to inspect the database, which also contained many fields that the app didn't use. Previously the app also didn't have any tests, but now the legacy has been rescued and it's what you'd expect from TDD'd code, including unit tests for the UI components.