Web Log of Ross Chapman

Web Log of Ross Chapman

499 closed connections

Bugs reveal. I look, observe. I learn things. I just experienced another one.

The customer can’t publish. Ensue existential how come???

After poking around I noticed our client code was deleting a parent entity too eagerly during a fail case while create operations were in flight for hiearchically bound entities – too sanguine, our home-backd front-end ROLLBACK. If the parent save call failed, subsequent saves of child data would nevertheless proceed, leaving unhooked child data stranded in the db. When the user would later hit “Publish”, our system would crash, unable to reconcil the ill-begotten state.

Take a look at the code (simplified for example):

Can you see how this code was written a bit too simplistically? From what I can tell there are at least two latent problems that make this code prone to fail in a way we don’t want.

  1. First, a parse error may be thrown during “other synchronous things” after the saveChildEntity promise is fulfilled. See a contrived example of that: async/await with synchronous error

  2. Second, it’s possible that the POST request initiated by saveChildEntity may succeed on the backend and persist the child data, but the connection between browser and server may be severed before the browser recieves the 200 and the promise becomes fulfilled! When that happens, the promise is actually rejected and the runtime goes into the catch block.

In the end, it was the latter.

It seems obvious in retrospect. We allow the user to hit Publish anytime which kicks off a heavy network sequence that finishes with a full page reload. Yet, while the publish sequence is in flight the user can still interact with the page. Meaning they could click another visible button – “Save” – that triggers saveStuffThunk. Based on the server logs, it seems that fairly often the Publish sequence would complete and then start to reload the page right in the middle of the second try/catch block of saveStuffThunk. When that happens nginx sends down a special 499 status code meaning the client closed the connection before the server responded with a request. The client code then interprets this as an error and sends the runtime into the catch-delete block.

The server logs (simplified):

  • POST /save/
  • DELETE / 499

It still blows my mind this happened consistently to effect hundreds of records. The browser deterministically queues/coordinates? It was a very strange UX-driven race condition.

In addition to realizing that our thunk code was written too optimistically, another aspect of this bug that fascinates me was the discovery that we had missed the really really important requirement of locking the page for the user when they click the “Publish” button. This was actually implemented for other similar interfaces, but when my team implemented a new screen with similar access to the “Publish” button, we didn’t fully understanding the potentiality of allowing this race condition. Or how to prevent it.

Organizational debt becoming bug. A big complex system with fast-shifting pubertal code and fugitive ownership creating blind spots.

“Every existing feature, and even past bugs, makes every new feature harder. Every user with expectations is a drag on change.” - jessitron

It was a weird one but we observed some new things and thereby pushed our learning edge farther out.

Like, tangential learning came from investigating potential sources of 499s. While digging, folowing a hunch about the load balancers sitting in front of our API servers, I discovered that connections could be cancelled eagerly by such a load balancer upon a timeout. Because Publishing was a long-ish operation, at one point in the debugging adventure we surmised a heavy query might be exceeding the a timeout interval. See: Nginx 499 error codes. Nevertheless that was a false start; our ops folks were able to confirm we didn’t have load balancers managing these particular requests.

Demystifying architecture is an important part of this process for the perspicacious dev.

I’m just hard reflecting on how signals of “broken” – like bad data – can reveal many interesting things about the system. Just think about how much our client promise handling hid national treasures.