Over the last two years, I’ve been involved in a large, distributed software project. We’re getting closer to final delivery so what follows is a couple of lessons learned when building or thinking about distributed systems.

Things I’ve Learned

Have a plan for communication and coordination

(Note: I’m re-reading this before hitting publish and… bleh… I still don’t have a good recommendation for how this communication/coordination should be done. This is still an opportunity.)

This one is pretty key to a successful distributed project. Communicating change and schedule on a large project is exceedingly difficult. Getting this figured out from day one is nearly as critical as figuring out what you’re building.

Creating too much traffic for the rest of the system

Especially if you are coding a “worker” style system, creating traffic and load for other parts of the system is wonderfully easy to do. Without a doubt, time pressures will be real, developers will be forced to make compromises and so will ship less than optimal code. The sooner you understand this and monitor for it, the easier time you’ll have when it’s time to deliver.

Retries are a requirement in a distrubuted system; trust no one

This is just common sense, as soon as a request or data leaves your system you have given up all control of the success or failure of your operation. However, once again, the realities of day to day work make this very easy to overlook. A simple REST call can fail in ways you haven’t even thought of. The same data can be given to you twice. Services can be manually shutdown. The cloud provider you’re using might decide to “take a break”. 99% uptime means you are pretty much guaranteed to experience the 1% downtime way more than you care to. Distributed systems are tricky, everything will fail at some point; trust no one.

Buffer requests/work that are coming into your system

So this was a great reminder for me. We were very busy trying to find ways to get my system to be a better network neighbor; we were trying to reduce the amount of network traffic my system was creating. We made good strides but eventually noticed diminishing returns. Someone else on the team suggsted we buffer the input from my system so that the other system could regulate the flow of traffic. It worked.

Buffering works both ways.

Have unique, in-code scaling options

Scaling in the cloud is “easy”… right? Nope! Not always.

As load begins to increase on a system you will start to see behavior you haven’t seen in your system. What’s more, the load testing… may not happen until a week before launch when it’s too late to make necessary changes. Having software driven options to scale your system is a very convenient way to get just a little more out of your system.

Consider the data (order of events)

It’s easy to say that your transactions should be idempotent or functional. However, it really depends on the data itself. For instance, some data can behave similarly to a ledger. As long as you start with the correct balance and as long as you receive all of the transactions, you will eventually be consistent.

Some data may behave differently. Some data may act as a toggle; a representation of state. Think of a light switch. You can certainly toggle a switch to either of it’s two states, on or off, in any sequence that you like. However, at the end of all your switching, the order in which you toggled the switch will matter to the final state of the switch, on or off.

Consider the data itself before deciding if eventual consistency is possible.

Consider asynchronous possibilities

A distributed system is a system where everything must be questioned. If something is supposed to be available then there will definitely be times when it is not available. If an event happens that indicates some data is available, then there will absolutely be times when it is not available. Asynchronous systems require that you think like everything will work as designed, but code as if nothing will work as designed.