NiFi 1.3: Simplest way possible of creating CSV files from SQL queries

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

NiFi 1.3: Simplest way possible of creating CSV files from SQL queries

Márcio Faria
Hi,

I'm looking for the simplest way possible of creating CSV files from SQL queries using Apache NiFi 1.3. 

The flow I currently have (the files are to be SFTP'ed to a remote server):

ExecuteSQL -> UpdateAttribute -> ConversionRecord [3 CSs] -> PutSFTP

The concept of SchemaRegistry is new to me, but if I understood it correctly in order for the ConversionRecord to work properly is necessary to have 3 Controller Services ([3 CSs]) associated with it:
  • AvroSchemaRegistry, with the schema defined in Avro Schema (JSON);
  • AvroReader, referring to the above schema;
  • CSVRecordSetWriter, also referring to the same schema.

It seems there are many benefits in using the schema registry, including versioning, validation, etc, but in my example a simpler configuration would be welcome. 

Isn't the schema already defined by ExecuteSQL? Can I have the ConversionRecord alone with no dedicated SchemaRegistry (property), AvroReader,(instance), or CSVRecordSetWriter (instance)? Of course, we'd still need to specify the output is a CSV, so perhaps a shared CSVRecordSetWriter that also gets its schema from the flow file would still be useful.

By the way, would the Schema Access Strategy named "Use Embedded Avro Schema" be part of a simpler solution? How?

In the same vein, what about having the schema-name property optionally defined by the ExecuteSQL itself, so we don't have to depend on the UpdateAttribute component? 

In summary, I'm wondering if it's possible to have 3 (+ 1 generic) components instead of 6 per query:

ExecuteSQL -> ConversionRecord [CSVRecordSetWriter] -> PutSFTP

That would make a difference when defining multiple conversions from SQL to CSV, or other equivalent flows.

In addition, consider that someone might want to have maximum flexibility, meaning that it would be totally acceptable to change the query and get a different layout for the resulting CSV file, without having to change any SchemaRegistry, Reader, or Writer.
 
I've found a few tickets out there covering a similar topic. In particular, [1] mentions the difficulty with more complex Avro data types. But I don't see that being a blocker when the data source is an old-fashioned SQL query. 

Recommendations?

P.S.1 Maybe templates would save the effort, but since Controller Services are "global", I'm still wondering if having too many parts would make it more difficult to manage lots of flows than it could be.

P.S.2 Will my 1st flow have a good performance? I'm wondering if another advantage of using SchemaRegistry etc is that it prevents the creation of too many records at once.

Thank you,

Marcio



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [EXT] NiFi 1.3: Simplest way possible of creating CSV files from SQL queries

Peter Wicks (pwicks)

I hate to respond with “me too”, but I haven’t seen a response and this kind of simplification is of interest to me.

 

The PutDatabaseRecord processor already does something similar, and I have only needed the AvroReader processor without a schema registry.

 

 

From: Márcio Faria [mailto:[hidden email]]
Sent: Tuesday, July 25, 2017 11:09 AM
To: Users <[hidden email]>
Subject: [EXT] NiFi 1.3: Simplest way possible of creating CSV files from SQL queries

 

Hi,

 

I'm looking for the simplest way possible of creating CSV files from SQL queries using Apache NiFi 1.3. 

 

The flow I currently have (the files are to be SFTP'ed to a remote server):

 

ExecuteSQL -> UpdateAttribute -> ConversionRecord [3 CSs] -> PutSFTP

 

The concept of SchemaRegistry is new to me, but if I understood it correctly in order for the ConversionRecord to work properly is necessary to have 3 Controller Services ([3 CSs]) associated with it:

  • AvroSchemaRegistry, with the schema defined in Avro Schema (JSON);
  • AvroReader, referring to the above schema;
  • CSVRecordSetWriter, also referring to the same schema.

 

It seems there are many benefits in using the schema registry, including versioning, validation, etc, but in my example a simpler configuration would be welcome. 

 

Isn't the schema already defined by ExecuteSQL? Can I have the ConversionRecord alone with no dedicated SchemaRegistry (property), AvroReader,(instance), or CSVRecordSetWriter (instance)? Of course, we'd still need to specify the output is a CSV, so perhaps a shared CSVRecordSetWriter that also gets its schema from the flow file would still be useful.

 

By the way, would the Schema Access Strategy named "Use Embedded Avro Schema" be part of a simpler solution? How?

 

In the same vein, what about having the schema-name property optionally defined by the ExecuteSQL itself, so we don't have to depend on the UpdateAttribute component? 

 

In summary, I'm wondering if it's possible to have 3 (+ 1 generic) components instead of 6 per query:

 

ExecuteSQL -> ConversionRecord [CSVRecordSetWriter] -> PutSFTP

 

That would make a difference when defining multiple conversions from SQL to CSV, or other equivalent flows.

 

In addition, consider that someone might want to have maximum flexibility, meaning that it would be totally acceptable to change the query and get a different layout for the resulting CSV file, without having to change any SchemaRegistry, Reader, or Writer.

 

I've found a few tickets out there covering a similar topic. In particular, [1] mentions the difficulty with more complex Avro data types. But I don't see that being a blocker when the data source is an old-fashioned SQL query. 

 

Recommendations?

 

P.S.1 Maybe templates would save the effort, but since Controller Services are "global", I'm still wondering if having too many parts would make it more difficult to manage lots of flows than it could be.

 

P.S.2 Will my 1st flow have a good performance? I'm wondering if another advantage of using SchemaRegistry etc is that it prevents the creation of too many records at once.

 

Thank you,

 

Marcio

 

 

[NIFI-1372] Create ConvertAvroToCSV - ASF JIRA

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [EXT] NiFi 1.3: Simplest way possible of creating CSV files from SQL queries

Joe Witt
There are some great points here.  Am on a phone right now so will be very brief.  The current 1.4.0 snapshot on master has some nice improvements for ease of use with record handling.  This includes writers inheriting schema from the reader and others.

Thanks

On Aug 1, 2017 1:20 AM, "Peter Wicks (pwicks)" <[hidden email]> wrote:

I hate to respond with “me too”, but I haven’t seen a response and this kind of simplification is of interest to me.

 

The PutDatabaseRecord processor already does something similar, and I have only needed the AvroReader processor without a schema registry.

 

 

From: Márcio Faria [mailto:[hidden email]]
Sent: Tuesday, July 25, 2017 11:09 AM
To: Users <[hidden email]>
Subject: [EXT] NiFi 1.3: Simplest way possible of creating CSV files from SQL queries

 

Hi,

 

I'm looking for the simplest way possible of creating CSV files from SQL queries using Apache NiFi 1.3. 

 

The flow I currently have (the files are to be SFTP'ed to a remote server):

 

ExecuteSQL -> UpdateAttribute -> ConversionRecord [3 CSs] -> PutSFTP

 

The concept of SchemaRegistry is new to me, but if I understood it correctly in order for the ConversionRecord to work properly is necessary to have 3 Controller Services ([3 CSs]) associated with it:

  • AvroSchemaRegistry, with the schema defined in Avro Schema (JSON);
  • AvroReader, referring to the above schema;
  • CSVRecordSetWriter, also referring to the same schema.

 

It seems there are many benefits in using the schema registry, including versioning, validation, etc, but in my example a simpler configuration would be welcome. 

 

Isn't the schema already defined by ExecuteSQL? Can I have the ConversionRecord alone with no dedicated SchemaRegistry (property), AvroReader,(instance), or CSVRecordSetWriter (instance)? Of course, we'd still need to specify the output is a CSV, so perhaps a shared CSVRecordSetWriter that also gets its schema from the flow file would still be useful.

 

By the way, would the Schema Access Strategy named "Use Embedded Avro Schema" be part of a simpler solution? How?

 

In the same vein, what about having the schema-name property optionally defined by the ExecuteSQL itself, so we don't have to depend on the UpdateAttribute component? 

 

In summary, I'm wondering if it's possible to have 3 (+ 1 generic) components instead of 6 per query:

 

ExecuteSQL -> ConversionRecord [CSVRecordSetWriter] -> PutSFTP

 

That would make a difference when defining multiple conversions from SQL to CSV, or other equivalent flows.

 

In addition, consider that someone might want to have maximum flexibility, meaning that it would be totally acceptable to change the query and get a different layout for the resulting CSV file, without having to change any SchemaRegistry, Reader, or Writer.

 

I've found a few tickets out there covering a similar topic. In particular, [1] mentions the difficulty with more complex Avro data types. But I don't see that being a blocker when the data source is an old-fashioned SQL query. 

 

Recommendations?

 

P.S.1 Maybe templates would save the effort, but since Controller Services are "global", I'm still wondering if having too many parts would make it more difficult to manage lots of flows than it could be.

 

P.S.2 Will my 1st flow have a good performance? I'm wondering if another advantage of using SchemaRegistry etc is that it prevents the creation of too many records at once.

 

Thank you,

 

Marcio

 

 

[NIFI-1372] Create ConvertAvroToCSV - ASF JIRA

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [EXT] NiFi 1.3: Simplest way possible of creating CSV files from SQL queries

Bryan Bende
I would think we could modify ExecuteSQL to have a property for an optional RecordWriter. It would still create the Avro schema from the DB schema as it does today, and then if the RecordWriter was specified it would use that to write records, otherwise it would write the Avro as it does today. This way you could go straight to CSV.

Without that, as Joe mentioned with the improvements in master (1.4.0 snapshot) you should be able to have an AvroReader using a "Schema Access Strategy" of "Embedded Avro Schema" which is coming from ExecuteSQL, and then a CsvRecordSetWriter with "Schema Access Strategy" of "Inherit Record Schema". So you wouldn't need a schema registry at all and wouldn't need the UpdateAttribute.

-Bryan


On Tue, Aug 1, 2017 at 9:02 AM, Joe Witt <[hidden email]> wrote:
There are some great points here.  Am on a phone right now so will be very brief.  The current 1.4.0 snapshot on master has some nice improvements for ease of use with record handling.  This includes writers inheriting schema from the reader and others.

Thanks

On Aug 1, 2017 1:20 AM, "Peter Wicks (pwicks)" <[hidden email]> wrote:

I hate to respond with “me too”, but I haven’t seen a response and this kind of simplification is of interest to me.

 

The PutDatabaseRecord processor already does something similar, and I have only needed the AvroReader processor without a schema registry.

 

 

From: Márcio Faria [mailto:[hidden email]]
Sent: Tuesday, July 25, 2017 11:09 AM
To: Users <[hidden email]>
Subject: [EXT] NiFi 1.3: Simplest way possible of creating CSV files from SQL queries

 

Hi,

 

I'm looking for the simplest way possible of creating CSV files from SQL queries using Apache NiFi 1.3. 

 

The flow I currently have (the files are to be SFTP'ed to a remote server):

 

ExecuteSQL -> UpdateAttribute -> ConversionRecord [3 CSs] -> PutSFTP

 

The concept of SchemaRegistry is new to me, but if I understood it correctly in order for the ConversionRecord to work properly is necessary to have 3 Controller Services ([3 CSs]) associated with it:

  • AvroSchemaRegistry, with the schema defined in Avro Schema (JSON);
  • AvroReader, referring to the above schema;
  • CSVRecordSetWriter, also referring to the same schema.

 

It seems there are many benefits in using the schema registry, including versioning, validation, etc, but in my example a simpler configuration would be welcome. 

 

Isn't the schema already defined by ExecuteSQL? Can I have the ConversionRecord alone with no dedicated SchemaRegistry (property), AvroReader,(instance), or CSVRecordSetWriter (instance)? Of course, we'd still need to specify the output is a CSV, so perhaps a shared CSVRecordSetWriter that also gets its schema from the flow file would still be useful.

 

By the way, would the Schema Access Strategy named "Use Embedded Avro Schema" be part of a simpler solution? How?

 

In the same vein, what about having the schema-name property optionally defined by the ExecuteSQL itself, so we don't have to depend on the UpdateAttribute component? 

 

In summary, I'm wondering if it's possible to have 3 (+ 1 generic) components instead of 6 per query:

 

ExecuteSQL -> ConversionRecord [CSVRecordSetWriter] -> PutSFTP

 

That would make a difference when defining multiple conversions from SQL to CSV, or other equivalent flows.

 

In addition, consider that someone might want to have maximum flexibility, meaning that it would be totally acceptable to change the query and get a different layout for the resulting CSV file, without having to change any SchemaRegistry, Reader, or Writer.

 

I've found a few tickets out there covering a similar topic. In particular, [1] mentions the difficulty with more complex Avro data types. But I don't see that being a blocker when the data source is an old-fashioned SQL query. 

 

Recommendations?

 

P.S.1 Maybe templates would save the effort, but since Controller Services are "global", I'm still wondering if having too many parts would make it more difficult to manage lots of flows than it could be.

 

P.S.2 Will my 1st flow have a good performance? I'm wondering if another advantage of using SchemaRegistry etc is that it prevents the creation of too many records at once.

 

Thank you,

 

Marcio

 

 

[NIFI-1372] Create ConvertAvroToCSV - ASF JIRA

 


Loading...