Best practices for handling large files

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Best practices for handling large files

Mike Thomsen
I have one flow that will have to handle files that are anywhere from 500mb to several GB in size. The current plan is to store the in HDFS or S3 and then bring them down for processing in NiFi. Are there any suggestions on how to handle such large single files?

Thanks,

Mike
Reply | Threaded
Open this post in threaded view
|

Re: Best practices for handling large files

Andy LoPresto-2
Mike,

Are the files a single coherent piece of information (i.e. a video file) or collections of smaller atomic units of data (i.e. CSV, JSON batches)? In the first case, it’s important to ensure that the processors which deal with the content do so in a streaming manner so as not to exhaust your heap (and ensure any customer processors you develop do the same), and and with the latter, when splitting and merging these records, we generally propose a two-step approach, where a single giant file is split into medium size flowfiles, and then each of these is split into individual records (i.e. 1 * 1MM -> 10 * 100K -> 10 * 100K * 1 as opposed to 1 * 1MM -> 1MM * 1). 

Other than that, be sure to follow the best practices for configuration in the Admin Guide [1] and read about performance expectations [2]. 



Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Apr 7, 2017, at 5:26 AM, Mike Thomsen <[hidden email]> wrote:

I have one flow that will have to handle files that are anywhere from 500mb to several GB in size. The current plan is to store the in HDFS or S3 and then bring them down for processing in NiFi. Are there any suggestions on how to handle such large single files?

Thanks,

Mike


signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Best practices for handling large files

Mike Thomsen
Thanks, that's actually what I ended up doing. In case anyone comes along looking for this. The approach we used for development was:

GetFile -> SplitText (50k chunks) -> SplitText (1 line/flowfile) -> the rest

On Fri, Apr 7, 2017 at 1:11 PM, Andy LoPresto <[hidden email]> wrote:
Mike,

Are the files a single coherent piece of information (i.e. a video file) or collections of smaller atomic units of data (i.e. CSV, JSON batches)? In the first case, it’s important to ensure that the processors which deal with the content do so in a streaming manner so as not to exhaust your heap (and ensure any customer processors you develop do the same), and and with the latter, when splitting and merging these records, we generally propose a two-step approach, where a single giant file is split into medium size flowfiles, and then each of these is split into individual records (i.e. 1 * 1MM -> 10 * 100K -> 10 * 100K * 1 as opposed to 1 * 1MM -> 1MM * 1). 

Other than that, be sure to follow the best practices for configuration in the Admin Guide [1] and read about performance expectations [2]. 



Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Apr 7, 2017, at 5:26 AM, Mike Thomsen <[hidden email]> wrote:

I have one flow that will have to handle files that are anywhere from 500mb to several GB in size. The current plan is to store the in HDFS or S3 and then bring them down for processing in NiFi. Are there any suggestions on how to handle such large single files?

Thanks,

Mike


Reply | Threaded
Open this post in threaded view
|

Re: Best practices for handling large files

Joe Witt
Mike,

A great advance that has occured with the Apache NiFi 1.2.0 release is
support for record readers/writers (controller services) and a set of
processors that leverage them.  This allows for far more efficient
processing and for many cases completely eliminates the past needs to
split down to single event flow files.  Definitely worth a look.  Here
is a blog from today that highlights it a bit.  Happy to talk through
your case with you to help see how it can be done using this method.
I've got a flow running now where each box is able to run SQL queries
against record streams at a rate of several hundred events/sec with
full content archive/provenance turned on with live indexing.  Far
more efficient than the previous approach.

https://blogs.apache.org/nifi/entry/real-time-sql-on-event

Thanks
Joe

On Tue, Jun 6, 2017 at 7:23 PM, Mike Thomsen <[hidden email]> wrote:

> Thanks, that's actually what I ended up doing. In case anyone comes along
> looking for this. The approach we used for development was:
>
> GetFile -> SplitText (50k chunks) -> SplitText (1 line/flowfile) -> the rest
>
> On Fri, Apr 7, 2017 at 1:11 PM, Andy LoPresto <[hidden email]> wrote:
>>
>> Mike,
>>
>> Are the files a single coherent piece of information (i.e. a video file)
>> or collections of smaller atomic units of data (i.e. CSV, JSON batches)? In
>> the first case, it’s important to ensure that the processors which deal with
>> the content do so in a streaming manner so as not to exhaust your heap (and
>> ensure any customer processors you develop do the same), and and with the
>> latter, when splitting and merging these records, we generally propose a
>> two-step approach, where a single giant file is split into medium size
>> flowfiles, and then each of these is split into individual records (i.e. 1 *
>> 1MM -> 10 * 100K -> 10 * 100K * 1 as opposed to 1 * 1MM -> 1MM * 1).
>>
>> Other than that, be sure to follow the best practices for configuration in
>> the Admin Guide [1] and read about performance expectations [2].
>>
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#configuration-best-practices
>> [2]
>> https://nifi.apache.org/docs/nifi-docs/html/overview.html#performance-expectations-and-characteristics-of-nifi
>>
>>
>> Andy LoPresto
>> [hidden email]
>> [hidden email]
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>
>> On Apr 7, 2017, at 5:26 AM, Mike Thomsen <[hidden email]> wrote:
>>
>> I have one flow that will have to handle files that are anywhere from
>> 500mb to several GB in size. The current plan is to store the in HDFS or S3
>> and then bring them down for processing in NiFi. Are there any suggestions
>> on how to handle such large single files?
>>
>> Thanks,
>>
>> Mike
>>
>>
>