We were freshly into production with this system that had a lot of moving parts. The whole system consisted of several big monolithic applications and half a dozen newer microservice-style services integrated into a “living organism”.
A user was not getting the results she expected. We investigated the problem and found out that a service we were calling was not returning some of the data we were interested in. In particular, data older than three months was not seen by us. A support person for the service said: ‘you should use wide search if you want to see the older items’. But wait – we were using wide search, always had been!
I double-checked the code and sure enough, it did say widesearch=”true”. Strange. So I blew the dust off the API spec and found this: widesearch=”1|0”. Yikes, the value should be a number! And I had coded this, totally my bad. I guess I’ve grown too used to using true and false for boolean values in XML. We were a month into production when we found out we were not using wide search. Never had been!
Why did we not get an error when we called the service with bad parameters? 1 and 0 look like numbers to me, but the service had been happily accepting the string “true” all along. I decided to run a little experiment against the service in the test environment:
<request key="..." widesearch=”anything goes?” /> → <response>OK ...valid search results...</response>.
Okay, so it seems there is no validation of the parameter whatsoever. The code probably reads something like this:
Why did we not find this bug during testing? This integration point was hammered a lot because during the project there were a lot of version updates on both sides of the API fence. Before going into production, the integration was in use for at least half a year in the testing environment. Somehow we just managed to not hit the three-month limit. For us ‘widesearch’ was just a mandatory constant we had to set in the API, not a thing that would be interesting to test.
Now, imagine that the service actually checked that the parameters in the API call have correct data types. We would have found our bug on day one of testing. Actually earlier – the first time we tried to call the test service. Silently choosing a default when the input is garbage does a big disservice to making the whole system robust.
When computers talk to computers it is better to be strict in the protocol. Problems get flushed out way quicker.