In the past we’ve discussed Cardinal’s Architecture and how Rivet uses data replication and reorganization to scale efficiently and respond to requests quickly.
Normally, our Geth Master servers connect to peers on the internet, validate blocks, and stream information over Kafka to our Cardinal and Flume replicas. With Polygon, we adapted this behavior to Bor, and since Bor is derived from Geth the work to get data out of Polygon and into Cardinal, the work was pretty simple. Or so we thought at first.
Polygon introduced a new set of problems we weren’t quite prepared for. While Ethereum and most of the other networks we support have block times on the order of 12 seconds, Polygon produces new blocks every 2 seconds. This reduced block time poses some interesting challenges.
First is database compaction. Geth (and by extension Bor) use a database engine called leveldb, which is a straight-forward key-value store. But every so often, leveldb has to do a database compaction, where it consolidates recently written information with older data and deduplicates values that have been written multiple times. When this happens, it tends to lock up the database for about 30 seconds. In Geth, this means you fall about 2 blocks behind, and you’re caught up within a second or two. With Polygon’s 2 second block time, a 30 second compaction means you fall 15 blocks behind, and it takes about another 30 seconds to catch back up. That’s a significant difference. Running two masters, this typically isn’t an issue, unless one of the masters is in the process of restarting (which happens occasionally) or a state trie commit happens at the same time.
State trie commits are challenge number two. Geth (and by extension Bor) keeps a “dirty cache” in memory. The dirty cache represents updates to the state trie that have not yet been committed to disk. Every so often, the dirty cache gets flushed to disk so that if your node restarts abruptly, you only have to resync from the point that was flushed to disk. If the dirty cache is configured to 1 GB, it takes about 2 minutes to flush the state trie to disk, during which time blocks stop processing. When the process finishes, not only are you two minutes behind on block processing, data which used to be cached in memory is now on disk, so block processing is a bit slower. In Geth, a 2 minute flush puts you about 10 blocks behind, and recovery happens in about 5 seconds. With Polygon’s 2 second block time, you end up 60 blocks behind, which can take a good 5 minutes to recover from in total. Counter-intuitively, increasing memory usage actually makes this problem worse, as you end up having more state trie data that needs to be committed to disk, which takes longer. Again, this isn’t a problem except when both masters do state trie commits at the same time, or a combination of state trie commits and leveldb compactions at the same time.
While database compactions and state trie commits happen on Ethereum as well as Polygon, they never presented much problem with 12 second block times. They seldom happen at the same time on both masters, and if it does happen, recovery is quick. But on Polygon, the increased recovery time means it’s far more likely to happen on multiple masters concurrently, and there were several occasions where we found our nodes fell behind for minutes at a time.
We took two main approaches to fixing this.
First, our master servers have historically operated 100% independently of one another. They didn’t know if they were the only master, or one of several, and since they had no idea if others existed, they certainly didn’t know if others were healthy. We updated our masters’ communication protocol to allow them to keep track of eachother. They each send a heartbeat at regular intervals, and track eachother’s heartbeats to determine how many healthy counterparts they have. When they start a state trie commit, they stop sending heartbeats, allowing other masters to mark them as unhealthy. When a node finds that it has no healthy counterparts, it will not start a state trie commit until it sees another master is back up and serving. This ensures that state trie commits will not happen concurrently, or if one master is down the other master won’t begin a state trie commit until its companion is back.
In addition to tracking eachother’s heartbeats, masters also start watching Kafka for what blocks have been published by other masters. If they see a block come across Kafka before they have broadcast that block, they don’t rebroadcast the same block. This reduces the amount of duplication downstream consumers (Cardinal and Flume) have to deal with. And because there is less duplicated traffic on Kafka, we can run even more masters concurrently without impacting downstream performance, ensuring that state trie commits and compactions are unlikely to come up on all of our masters at the same time.