Sanity Checking an Event Sourced System

March 19, 2017 — Permalink

Around one and a half years ago I stumbled upon Domain Driven Design, Event Sourcing and CQRS. I'm still amazed about the world behind that door and its sheer size. Without countless blog posts and talks I wouldn't have started walking on this path so here is my first attempt at giving back.

My side project dartboard.io helps you playing dart by offering an easy way to keep track of the scores and calculating interesting statistics. It is also my playing field field for experimenting with Event Sourcing and learning to understand it. The application started one year before I learned about Domain Driven Design so it should be no surprise that the first iteration was built upon CRUD concepts. After one year in production the CRUD version transformed to the second iteration based on Event Sourcing principles. Every dart match that was being played would result in a conventional CRUD model but there is an events attribute that is being used for the Event Sourcing part. For the lack of a better word I call this implementation In-Place Event Sourcing.

Why do a sanity check of the events?

While not yet implemented the application will allow players to play a remote dart match in a future iteration. In order to keep multiple players up-to-date the events present themselves as an ideal model for synchronisation. When you are at a point where single events get pushed or pulled within your system it seems easier to implement a classic event store and say farewell to the In-Place Event Sourcing implementation.

Before making such a transition I wanted to verify how robust my implementation is. After all it's my first event sourced system and I knew from the past that there were one or two bugs that resulted in impossible event sequences. I just did not know how many of them would be there.

How did the sanity check work?

First I manually built a map of events and their possible successor events. These instructions basically laid out the rules for the computer. "If you have an event X the next one can be Y or Z". You can take a look at the real map or be happy with my abstract example:

Instructions for the sanity check

PlayerAdded  -> PlayerAdded, MatchStarted
MatchStarted -> LegAdded
PlayerScored -> PlayerWonLeg, TurnChanged

Based on this map the computer would iterate through the events of a match and check the sequence at every step. If a violation was detected it would output at which event it happened, what the successor event was and that successors it expected and move to the next match. In addition it would group the violations by their pattern and output a summary to make analysis easier.

Grouped output of a specific violation

Pattern:       MatchStarted -> PlayerAdded
Occurrences:   3
Match IDs:     1, 20, 45

What was the outcome?

An interesting and unexpected look at the system that I built. After checking 16.391 matches with a total of 3.463.360 events there are several things to point out.

1.) I once more fully appreciated the idea behind Event Sourcing. It was a really nice feeling realising how easy it is to go back in time and see how a dart match evolved. For me personally this is probably the biggest benefit of Event Sourcing.

2.) My initial event map used for detecting violations was incomplete. There were three situations that could happen in the lifetime of a match that were not obvious to me. It took me a while to be sure that the system worked correctly and me not being aware of certain scenarios was the problem. As a result the sanity check itself was adapted and I also wrote new unit tests covering these scenarios specifically.

3.) The sanity check showed that past bug fixes were indeed effective. There were edge cases in the past that would produce event sequences that made no sense. By correlating the matches timestamp where a certain violation occurred with the timestamp of a commit I could tell that the bug fix was working.

4.) It also turned up new bugs that were not visible in the UI but would slightly skew statistical projections. On a technical level it showed that having thorough invariant checks in the match aggregate is vital. Raised errors by these checks would have surfaced the problems earlier.

5.) I realised I should really add timestamps to relevant events. Some issues would be easier to investigate if you could tell how much time passed between certain events.

6.) Old matches that were ported from the first iteration (CRUD) were missing events that were added later. In my case additional events would be emitted to make it possible to generate more advanced statistics.

Overall I'm very happy with the outcome of the sanity check and my decision to dive into Event Sourcing one and a half years ago. I'm not yet sure how to approach the sixth point with the missing events. Maybe I will know what to do after reading Versioning in an Event Sourced System. I hope it will also help me with introducing timestamps for certain existing events.

Thanks for reading and since I'm very inexperienced with Event Sourcing any feedback is welcome.