Discussion about this post

User's avatar
Allan Mertner's avatar

> Adding functionality to the central abstraction requires donning robes, burning incense, sacrifices. A mistake there is now a mistake for everybody.

Yes indeed, this can be pretty nasty. Are you going to talk about how to mitigate this downside?

Expand full comment
Esko Luontola's avatar

> Collect stories of software design as error reduction.

I inherited an application whose main job is to show on one page all information which the company has about a customer. The data is collected from numerous other internal systems, and it's an essential tool for customer support and sales.

In the original design, the data would arrive from the source systems as CSV files to S3. Then the app would copy the data to its own Elasticsearch database, one table per file, about dozen tables in total. Some of the data was also joined with data from our data warehouse before writing to Elasticsearch. Then when a user does a search, the app will do a full text search over all the columns of all the tables in Elasticsearch.

This design had numerous issues which affected the users:

- When the data format of the source CSV files changed, people would usually remember to update another app which copies the data to our data warehouse, but nobody remembered to update this app which read the same CSV files. Sometimes this broke a data integration for one table. And in any case there was duplicate effort in maintaining two systems doing the same thing.

- If there was a failure in loading one of the CSV files, it would result in some of the customers' data being partially out of date. Doing a full refresh of all data would have taken over 10 hours, so it was done in only exceptional situations and only one table at a time.

- One time the filename for a table's CSV files changed, leading to the app no longer receiving new data for that table (i.e. there was no crash, but the updates just silently stopped happening). It took 4 months until a user noticed and reported that the data was stale.

- When the data in the warehouse's joined tables changes, it would not be updated to Elasticsearch until the rows that refer to that data are also updated and they are included in a CSV delivery.

- Searches were very slow, often around 10 seconds. In Elasticsearch it's easy to do a query over all columns by specifying just one parameter, but under the covers it will do one search index lookup per column. Since there were hundreds of columns, including many columns this app didn't use, even if each column took just a few milliseconds to search, the whole query took many seconds.

I redesigned it to work like this:

- There is one SQL query which reads from the data warehouse and produces one JSON document per customer. The data comes from over 20 tables and a couple hundred columns, so the SQL query is quite big: around 400 lines of SQL, and 2-3 times as much test code for testing it (e.g. populate each table and compare the resulting JSON document for equality). The query takes nearly 10 minutes to run and returns some tens of gigabytes of data when loading all the few million customers.

- In Elasticsearch there is one table for all customer data. It contains just a few columns: (1) customer ID, (2) a gzipped JSON document with all customer data, (3) a small JSON document with the subset of customer data that is shown in search results, (4) a full-text searchable text field with a concatenation of all customer fields that somebody could possibly want to search (i.e. a couple dozen fields that uniquely identify a customer or one of their contracts).

- Each night, the app recreates the Elasticsearch tables from the data warehouse, which takes about one hour for the customer table. This also gets rid of the data of any deleted customers. Throughout the day, it checks the source tables for updates and refreshes only the customers whose data has changed.

This resulted in multiple benefits:

- Since the data is loaded directly from the data warehouse, there are multiple other people who are already keeping an eye on the data integrations and making sure that the data is up to date.

- The app is now self-healing. Even if the app would not detect some changes to the source tables (for example if somebody does a manual update and forgets to update the timestamp columns, or there is a bug), all data will be recreated latest the next night. Also every piece of a customer's data will be equally up to date with the warehouse data, because a customer's JSON document will always be replaced as a whole, instead of updating individual tables that contain only part of the data.

- The searches are much faster: p95 went from 6 seconds to 40 milliseconds. Searching one big column is much faster than hundreds of small columns. Also the app can just return the JSON documents as-is to the UI, instead of joining the data from multiple tables. The backend doesn't even need to parse the JSON: the full customer document is returned in its precompressed form with a HTTP content-encoding gzip header, and the search response can be constructed safely with string concatenation instead of needing a fully fledged JSON generator.

- Server costs were reduced by 80%, because it was possible to use much smaller server instances than before.

- The app's maintainability increased. For example from the customer query's tests you see the customer JSON document with every possible field populated, and also the subset JSON documents for the search results and the searchable fields. Previously you would have had to inspect the database, which also contained many fields that the app didn't use. Previously the app also didn't have any tests, but now the legacy has been rescued and it's what you'd expect from TDD'd code, including unit tests for the UI components.

Expand full comment
2 more comments...

No posts