Merging Small Files

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Merging Small Files

Steve Champagne
Hello,

I'm pulling data from API endpoints every five minutes and putting it into HDFS. This, however, is giving me quite a few small files. 288 files per day times however many endpoints I am reading. My current approach for handling them is to load the small files into some sort of staging directory under each of the endpoint directories. I then have list and fetch HDFS processors pulling them back into NiFi so that I can merge them based on size. This way I can keep the files in HDFS as they are waiting to be merged so they can be queried at any time. When they get close to an HDFS block size, I then merge them into an archive directory and delete the small files that were merged. 

My biggest problem with this is that I have to pull the files into NiFi where they might sit for extended periods waiting to be merged. This causes problems that I think are related to the problems brought up in NIFI-3376 where my content repository continues to grow unbounded and fills up my disk.

I was wondering what other patterns people are using for this sort of stuff.

Thanks!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Merging Small Files

Joe Witt
Steve

It is a very common case and should work very well.  Bring in data.  Use MergeContent even for long periods of time required to reach desired bin size.  Send somewhere.

If you are seeing the content repo fill up then let's look into that.  How large is the content repo?  Is it on its own partition?  What are the nifi.properties repo settings?  When it appears full on disk how much data does nifi see as active in the flow?

Thanks
Joe

On Aug 6, 2017 8:29 AM, "Steve Champagne" <[hidden email]> wrote:
Hello,

I'm pulling data from API endpoints every five minutes and putting it into HDFS. This, however, is giving me quite a few small files. 288 files per day times however many endpoints I am reading. My current approach for handling them is to load the small files into some sort of staging directory under each of the endpoint directories. I then have list and fetch HDFS processors pulling them back into NiFi so that I can merge them based on size. This way I can keep the files in HDFS as they are waiting to be merged so they can be queried at any time. When they get close to an HDFS block size, I then merge them into an archive directory and delete the small files that were merged. 

My biggest problem with this is that I have to pull the files into NiFi where they might sit for extended periods waiting to be merged. This causes problems that I think are related to the problems brought up in NIFI-3376 where my content repository continues to grow unbounded and fills up my disk.

I was wondering what other patterns people are using for this sort of stuff.

Thanks!
Loading...