RPG + FlowFiles In

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RPG + FlowFiles In

Neil Derraugh
I have a three node cluster and I am trying to rewrite a dataflow that's used in several places to have the common parts distribute the data across the cluster in a more efficient and load balanced way.  This is my first experience with RPGs, so I was just starting from basics and working my way up, but I am just out of the gate and already confused.

Here's the setup.  I have an input port on my root dataflow which points to a LogMessage processor.  In another process group I have an RPG configured with the three endpoints of the cluster separated by commas.  Feeding into that is a GenerateFlowFile processor which is running every 5ms with 9 concurrent tasks on the primary node only.  Everything else has default values.

When I start the dataflow it more or less works as expected except that the distribution of FlowFiles looks uneven.  That is if I look at the Status History of the LogMessage processor and select the FlowFiles In it looks like the two non-primary nodes have the bulk of the flows files moving through them.  I can wrap my head around that.

But then I rewrote it to put a DistributeLoad processor in front of three RPGs, one for each node in the cluster, and left it set to `round robin`.  The FlowFiles In on the LogMessage processor looks exactly the same as before.  The bulk of the FlowFiles In are on the two non-primary nodes.

In 5 minutes there are about 500K FlowFiles being processed and two non-primary nodes are processing 234238 and 233089, with the primary node processing 47597.

What am I missing?  Why doesn't a round robin distribute them evenly?

Neil
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RPG + FlowFiles In

Kevin Doran

Hi Neil,

 

I am also new to working with RPGs and NiFi clusters, but I know enough about the NiFi SiteToSite protocol to speculate as to what is going on here (if others on this list more knowledgeable than I are willing to chime in to confirm or correct this guess, that would be welcome!)

 

If I understand the flow you described where you were attempting to achieve round-robin / even distribution, you have 3 RPGs setup, each one configured to know about 1 node in your cluster. Therefore, the expectation is that putting a DistributeLoad processor upstream of the RPGs configured to round-robin will round robin the nodes. I can see how that would be the expectation given that configuration.

 

However, I think it's possible that a little bit more is going on under the hood with the RPG connection(s). If I understand the details correctly, when an RPG is configured, the cluster endpoint(s) you specify are only used to create the initial connection. Once the connection is made, the client will ask the cluster endpoint it knows about for all the nodes in the cluster, so that if nodes are added or removed to the cluster, all connected peers get updated.

 

If that's indeed the case, then the stable state in your flow is 3 RPGs that all know about all 3 clusters in the node. That would explain why adding DistributeLoad did not change the behavior you observed in your initial flow (one RPG configured with all three endpoints). If you wanted to further verify this, you could create a flow with a single RPGs that is configured for only 1 endpoint in your cluster. Over enough time (after the other nodes in the cluster have been discovered) you should see flow files reach the nodes you did not specify.

 

As to why you are not seeing even distribution, I'm not sure as I don't know the specifics of that load-balancing logic for sending files to RPGs. I know it is designed to evenly distributed load over time, so it's possible the time window over which you are collecting stats is smaller than the time period for which the RPG load balancing is optimized. In other words, if you let it run for longer and checked, is the load more evenly distributed? My speculation is that the load balancing is based on a periodic check of how many files have been processed by each node (rather than a check before every send, which would have a lot of overhead), and that the configured period of time to change the destination load is longer than would show up here. Again, a lot of guessing on my part. Maybe others can confirm.

 

I hope this helps. If you have more findings or questions, post them back here.

 

Thanks,
Kevin

 

 

From: Neil Derraugh <[hidden email]>
Reply-To: <[hidden email]>
Date: Monday, July 31, 2017 at 17:20
To: <[hidden email]>
Subject: RPG + FlowFiles In

 

I have a three node cluster and I am trying to rewrite a dataflow that's used in several places to have the common parts distribute the data across the cluster in a more efficient and load balanced way.  This is my first experience with RPGs, so I was just starting from basics and working my way up, but I am just out of the gate and already confused.

 

Here's the setup.  I have an input port on my root dataflow which points to a LogMessage processor.  In another process group I have an RPG configured with the three endpoints of the cluster separated by commas.  Feeding into that is a GenerateFlowFile processor which is running every 5ms with 9 concurrent tasks on the primary node only.  Everything else has default values.

 

When I start the dataflow it more or less works as expected except that the distribution of FlowFiles looks uneven.  That is if I look at the Status History of the LogMessage processor and select the FlowFiles In it looks like the two non-primary nodes have the bulk of the flows files moving through them.  I can wrap my head around that.

 

But then I rewrote it to put a DistributeLoad processor in front of three RPGs, one for each node in the cluster, and left it set to `round robin`.  The FlowFiles In on the LogMessage processor looks exactly the same as before.  The bulk of the FlowFiles In are on the two non-primary nodes.

 

In 5 minutes there are about 500K FlowFiles being processed and two non-primary nodes are processing 234238 and 233089, with the primary node processing 47597.

 

What am I missing?  Why doesn't a round robin distribute them evenly?

 

Neil

Loading...