After setting up some Qlik Sense servers without any real issues (rather the opposite – it is incredibly quick and easy to deploy an Enterprise scale Sense system!), one day one of the rim nodes fell off the cluster and showed up as red in the QMC.
The Windows server itself was fine, we could RDP into it without any problems whatsoever. But after trying to restart the Sense services on that server, none of them came back up, except for the Repository Database service. Nothing really helpful in the logs either. Rebooting the entire server didn’t help. That rim node was dead in the water.
Ok, let’s spin up another rim node to replace the failed one. At this point we are not happy about the stability of Sense.. but spinning up new virtual machines just takes minutes, so no big deal. Half an hour later we are back in business and all is well. This all happened late Friday afternoon/evening, of course…
Come the next day, Saturday. The central node experiences the same issue. This is not good, all metadata etc lives on the central node. We can’t replace it as easily as the rim nodes. Not good. Googling like crazy to find an explanation to what’s going on, without any real progress or success.
After an hour or two I recalled that we for security reasons did not allow any outbound Internet access from the Sense servers, at least not to begin with. Aha – that could be it. No outbound Internet, no possibility to do license checks or other tasks requiring Internet connectivity. But as there were no mentions about failed license checks in the logs, that did not seem like a likely explanation. Either way, added outbound Internet connectivity, and suddenly the Sense services on the central node could be started right away.
At this point it’s still unclear exactly what the root cause of the problem was. Our best guess is that the Windows servers making up a Sense cluster all need access to a time server to stay in sync. If they can’t update their respective clocks, their respective clocks will eventually drift ever so little and will be out of sync. Many other multi-node software packages are quite sensitive to this, so why not Sense too? After all, Sense relies heavily on all sorts of secure communication between the cluster nodes, and synced time is usually a key component in such communication schemes.
Either way – giving the Sense servers access to a time server, either one on the public Internet or one within the corporate internal network seems to solve the problem, the cluster has been rock steady since that was put in place.