Debugging distributed systems is terrible. There are many reasons for this – they are complicated. Errors happen that are transient and rare and dependent on precise timings. Impossible to reproduce locally to verify.
ways: breaking an unwritten contract. should be a then b but eventually happens to be b then a asymptomatic non-idempotent packets, a lost, retry success, a reappears
There is no repeatability in our debugging. The real problem is the messy dirty universe has intruded on our beautiful pristine land of pure functions. The execution history of this system is not a pure function of its inputs. There’s a source of entropy, of randomness, which you do not control – which is the network. And this happens in threads or disks or whatever.
FoundationDB is a database that makes meaningful guarantees to its users. ACID is not a property of a system that emerges by accident. There are a lot more ways of getting it wrong than getting it right. What’s even scarier is that there are many more ways of getting it right, then having it fail as soon as something unexpected happens. Than there are of getting it right and having it stick.
Don’t debug your system, debug a simulation.
FoundationDB didn’t start by writing a database, they started by writing a deterministic simulation of a database. There are 3 ingredients to deterministic simulation:
- Single threaded pseudo-concurrency
- Simulated implementation of all external communication
Single threaded pseudo-concurrency
We can’t test using threads since that can produce non-deterministic results. To test this in C++ they used callbacks and needed higher level abstractions than what were provided by the language, so they created Flow, which added actor-model concurrency to the language.
They used C++ classes as their network instead of the real network so they can simulate latency and other abnormalities. Use code to smoke out any code that assumes the network is reliable.
Once concurrency and the network are simulated, then make sure everything else is deterministic. This could include if your code generates and uses random numbers – then we’d want a pseudo random number generator so we can control the number generated. Time is also another, or disk space available – so running it at two different times could produce two different results.
They will often run the test with the exact same input twice and at the very end they check the random number generators and if it doesn’t produce the same numbers it’ll be seen as non-deterministic.
The past three things becomes your simulator. Your system can now be tested in a completely deterministic way.
Once determinism is established, test files are written to find bugs. Test files declare two separate things: one is the stuff the system is trying to achieve, and the other is a set of stuff that will attempt to prevent the system from achieving that.
Other popular ways to wreak havoc:
- broken machine (machine down, or breaks in some way like a log file is too full, broken clock)
- nukes (killing a data center)
- dumb sysadmin (swap the ip addresses or swap the data files of two nodes randomly)
The universe is looking for bugs, so are you. How can you make your CPUs more efficient than the universe’s? Another way of looking at this is that you need to find more bugs per CPU hour than the real world. Here’s how:
Disasters here happen more frequently than in the real world. We make things like disc failure happen every 2-3 minutes. We also speed up time and make many more sim seconds pass than in the real world.
Buggification, or randomly change what your code does so that the consumer of that code expands its understand of what could happen. This goes back to the example where there was a misunderstood contract between two nodes, where node one will send packet a, then b, but the other node may sometimes send packet b, then a. Buggify implies that you break the contract often to ensure the partners truly understand the contract. One example is to never send a timeout.
The Hurst exponent, it is a way to quantify the lack of independence between events. How does this relate to distributed systems? Hardware failures are not random independent events. This happens in the real world – a lot. Hardware should fail in some sort of random but in some way fail together in a cascading way.
By following these approaches, they estimate they have run the equivalent of trillions of real world CPU hours of real world tests. They can run 5-10 million simulations a night on their cluster, and depending on the test each one can take between 5 minutes and an hour or two.
The bad news is using callbacks has ruined their ability to debug by making it incredibly complex. They basically just printf their stuff, but it’s not as bad with deterministic simulations since the same exact environment will be triggered on future runs.
There is still the nightmare case which is: what if our simulations are wrong? There are two ways in which a simulation can be wrong: either it’s not brutal enough (there are a series of failures that happen in the real world which have not been put under simulation) or maybe they just misunderstand the contracts between the systems, or porting to a new platform and there are new guarantees or the OS has a bug. The simulations are only as good as their understanding of what the OS and the hardware actually do.
To try and catch both of these sorts of problems during simulation, they have something called Sinkhole. Sinkhole is an actual physical cluster of server motherboards which are all connected to programmable network power supplies. So they program these things to turn on and off all night while the database is running. This is particular to finding power safety bugs.
- Dedicated “Red Team” specifically to introduce bugs into the system
- More hardware to make the simulations run faster, and give more power to debugging
- Try to catch bugs that “evolve”. If I write a bug and it is caught by the simulation framework, I immediately get feedback. If I write a bug and it is not caught, I get no feedback. So I am slowly but surely being trained to write bugs that slip through the simulation framework – and somewhat bears resemblance to bacteria evolving to become resistant to antibiotics. One idea is to have two separate simulation frameworks and only run one before releases.
- More real world testing like Sinkhole, but dealing with hardware is a pain.