NiFi Cluster with lots of SUSPENDED, RECONNECTED, LOST events

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

NiFi Cluster with lots of SUSPENDED, RECONNECTED, LOST events

ddewaele
We have a node nifi cluster running with 3 zookeeper instances (replicated)
in a Docker Swarm Cluster.

Most of time the cluster is operating fine, but from time to time we notice
that Nifi stops processing messages completely. It eventually resumes after
a while (sometimes after a couple of seconds, sometimes after a couple of
minutes).

When I do a grep o.a.n.c.l.e.CuratorLeaderElectionManager
/srv/nifi/logs/nifi-app.log on the primary node, I see a lof of suspended /
reconnected messages.




Likewise on the other node, I see similar messages



The only real exceptions I'm seeing in the logs are these



I also this on the UI from time to time :

com.sun.jersey.api.client.ClientHandlerException:
java.net.SocketTimeoutException: Read timed out

Is there anything I can do to further debug this ?
Is it normal to see that many connection state changes ? (the logs are full
of them).
The solution is running on 3 VMs, using Docker Swarm. Nifi is running on 2
of those 3 VMs. We have a zookeeper setup running on all 3 VMs.

I don't see any errors in the zookeeper logs.






--
View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/NiFi-Cluster-with-lots-of-SUSPENDED-RECONNECTED-LOST-events-tp2194.html
Sent from the Apache NiFi Users List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NiFi Cluster with lots of SUSPENDED, RECONNECTED, LOST events

ddewaele
Seems nabble doesn't send the raw-text-formatted log snippets.

Added them in this gist :
https://gist.github.com/ddewaele/67ca6cb95b9c894a9eb8d782b2ad99a2



--
View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/NiFi-Cluster-with-lots-of-SUSPENDED-RECONNECTED-LOST-events-tp2194p2195.html
Sent from the Apache NiFi Users List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NiFi Cluster with lots of SUSPENDED, RECONNECTED, LOST events

Bryan Bende
There are a couple of settings in nifi.properties related to timeouts
that might be worth increasing and playing with:

nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec

nifi.zookeeper.connect.timeout=3 secs
nifi.zookeeper.session.timeout=3 secs

I would expect the one error about "KeeperError = Connection Loss"
would be related to the ZooKeeper timeouts.


On Tue, Jun 13, 2017 at 5:38 PM, ddewaele <[hidden email]> wrote:

> Seems nabble doesn't send the raw-text-formatted log snippets.
>
> Added them in this gist :
> https://gist.github.com/ddewaele/67ca6cb95b9c894a9eb8d782b2ad99a2
>
>
>
> --
> View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/NiFi-Cluster-with-lots-of-SUSPENDED-RECONNECTED-LOST-events-tp2194p2195.html
> Sent from the Apache NiFi Users List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NiFi Cluster with lots of SUSPENDED, RECONNECTED, LOST events

cupdike
In reply to this post by ddewaele
I was seeing this kind of problem and traced it to when a PutHDFS ran after a
MergeContent (also using Docker).  It would run fine until it started trying
to write the files to HDFS and then the network IO overwhelmed the container
(my theory anyway).  By using a shorter time interval I was able to get past
it (smaller, more frequent file writes). I tried to get the MergeContent to
bundle based on max bytes but could never get it working right so I used
time-based instead.



--
View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/NiFi-Cluster-with-lots-of-SUSPENDED-RECONNECTED-LOST-events-tp2194p2203.html
Sent from the Apache NiFi Users List mailing list archive at Nabble.com.
Loading...