NiFi Cluster with lots of SUSPENDED, RECONNECTED, LOST events
We have a node nifi cluster running with 3 zookeeper instances (replicated)
in a Docker Swarm Cluster.
Most of time the cluster is operating fine, but from time to time we notice
that Nifi stops processing messages completely. It eventually resumes after
a while (sometimes after a couple of seconds, sometimes after a couple of
When I do a grep o.a.n.c.l.e.CuratorLeaderElectionManager
/srv/nifi/logs/nifi-app.log on the primary node, I see a lof of suspended /
Likewise on the other node, I see similar messages
The only real exceptions I'm seeing in the logs are these
I also this on the UI from time to time :
java.net.SocketTimeoutException: Read timed out
Is there anything I can do to further debug this ?
Is it normal to see that many connection state changes ? (the logs are full
The solution is running on 3 VMs, using Docker Swarm. Nifi is running on 2
of those 3 VMs. We have a zookeeper setup running on all 3 VMs.
I was seeing this kind of problem and traced it to when a PutHDFS ran after a
MergeContent (also using Docker). It would run fine until it started trying
to write the files to HDFS and then the network IO overwhelmed the container
(my theory anyway). By using a shorter time interval I was able to get past
it (smaller, more frequent file writes). I tried to get the MergeContent to
bundle based on max bytes but could never get it working right so I used