4 Comments

> Adding functionality to the central abstraction requires donning robes, burning incense, sacrifices. A mistake there is now a mistake for everybody.

Yes indeed, this can be pretty nasty. Are you going to talk about how to mitigate this downside?

Expand full comment

> Collect stories of software design as error reduction.

I inherited an application whose main job is to show on one page all information which the company has about a customer. The data is collected from numerous other internal systems, and it's an essential tool for customer support and sales.

In the original design, the data would arrive from the source systems as CSV files to S3. Then the app would copy the data to its own Elasticsearch database, one table per file, about dozen tables in total. Some of the data was also joined with data from our data warehouse before writing to Elasticsearch. Then when a user does a search, the app will do a full text search over all the columns of all the tables in Elasticsearch.

This design had numerous issues which affected the users:

- When the data format of the source CSV files changed, people would usually remember to update another app which copies the data to our data warehouse, but nobody remembered to update this app which read the same CSV files. Sometimes this broke a data integration for one table. And in any case there was duplicate effort in maintaining two systems doing the same thing.

- If there was a failure in loading one of the CSV files, it would result in some of the customers' data being partially out of date. Doing a full refresh of all data would have taken over 10 hours, so it was done in only exceptional situations and only one table at a time.

- One time the filename for a table's CSV files changed, leading to the app no longer receiving new data for that table (i.e. there was no crash, but the updates just silently stopped happening). It took 4 months until a user noticed and reported that the data was stale.

- When the data in the warehouse's joined tables changes, it would not be updated to Elasticsearch until the rows that refer to that data are also updated and they are included in a CSV delivery.

- Searches were very slow, often around 10 seconds. In Elasticsearch it's easy to do a query over all columns by specifying just one parameter, but under the covers it will do one search index lookup per column. Since there were hundreds of columns, including many columns this app didn't use, even if each column took just a few milliseconds to search, the whole query took many seconds.

I redesigned it to work like this:

- There is one SQL query which reads from the data warehouse and produces one JSON document per customer. The data comes from over 20 tables and a couple hundred columns, so the SQL query is quite big: around 400 lines of SQL, and 2-3 times as much test code for testing it (e.g. populate each table and compare the resulting JSON document for equality). The query takes nearly 10 minutes to run and returns some tens of gigabytes of data when loading all the few million customers.

- In Elasticsearch there is one table for all customer data. It contains just a few columns: (1) customer ID, (2) a gzipped JSON document with all customer data, (3) a small JSON document with the subset of customer data that is shown in search results, (4) a full-text searchable text field with a concatenation of all customer fields that somebody could possibly want to search (i.e. a couple dozen fields that uniquely identify a customer or one of their contracts).

- Each night, the app recreates the Elasticsearch tables from the data warehouse, which takes about one hour for the customer table. This also gets rid of the data of any deleted customers. Throughout the day, it checks the source tables for updates and refreshes only the customers whose data has changed.

This resulted in multiple benefits:

- Since the data is loaded directly from the data warehouse, there are multiple other people who are already keeping an eye on the data integrations and making sure that the data is up to date.

- The app is now self-healing. Even if the app would not detect some changes to the source tables (for example if somebody does a manual update and forgets to update the timestamp columns, or there is a bug), all data will be recreated latest the next night. Also every piece of a customer's data will be equally up to date with the warehouse data, because a customer's JSON document will always be replaced as a whole, instead of updating individual tables that contain only part of the data.

- The searches are much faster: p95 went from 6 seconds to 40 milliseconds. Searching one big column is much faster than hundreds of small columns. Also the app can just return the JSON documents as-is to the UI, instead of joining the data from multiple tables. The backend doesn't even need to parse the JSON: the full customer document is returned in its precompressed form with a HTTP content-encoding gzip header, and the search response can be constructed safely with string concatenation instead of needing a fully fledged JSON generator.

- Server costs were reduced by 80%, because it was possible to use much smaller server instances than before.

- The app's maintainability increased. For example from the customer query's tests you see the customer JSON document with every possible field populated, and also the subset JSON documents for the search results and the searchable fields. Previously you would have had to inspect the database, which also contained many fields that the app didn't use. Previously the app also didn't have any tests, but now the legacy has been rescued and it's what you'd expect from TDD'd code, including unit tests for the UI components.

Expand full comment

I've actually written an article about improving software design, but anyways I think the whole point of changing software design is to save money (like you mentioned) and save time. If a software design is easier to understand then programmers can move fast and fix all the problems and also include new features more easily.

Expand full comment

> Collect stories of software design as error reduction.

Some simple generic but not directly related ones:

All the static analysis tools, code format that we use.

When you put them in place the first time, they force you to change everything to match their way of writing, for "no" benefit at all. But then, any errors reported could be that new york times error you want to avoid.

My code is hard to test, so I do not manage to test everything. Rework its implementation to split pure functions easy to test, state managing ones, split small specific code to a bigger generic one/external common library. Maybe your code is not "better" in first view, but its testable. And oh, the test you wrote found a bug that you would have missed. (I would even argue and saying it even has benefit on its own on being written better, but it's harder to sell alone and not the point here)

Your on target integration tests are working 90% of the time because of hardware/timing/real life issues, there are always flaky tests that you know about. Working on making these reliable has no benefit on the product as it was test code only being fixed. But if you fix it, maybe on the next release you would have realized that having one error is not normal and would not have shipped something broken mistaken for a false positive. (And usually, it shows a problem in the software stack somewhere too, but that would already say it's required for a customer, which is not necessary)

This even has a different effect on the team.

Linus was commenting that with `svn`, as branches were taking ages, nobody would use them at all, except for a stable branch/release something. `git` made them instant, now we can only finally really work with them to split our work. The tool capacity made us work differently.

If your test suite goes from "giving some overall information" to be really reliable, without false positive, quick, then it becomes an important part of your normal workflow.

If your unit-test suite is furnished enough, you trust it more that "passing is enough to ship".

If all the strict warnings are fixed and enabled, then you can run with `-Werror` and not let these problem slip through for a long time.

This is not directly "software architecture", but these things forces to consider what is "not developed for the customer" similar to all the tooling helpers you can have.

Expand full comment