My Tech Notes: HCatalog and Parquet

Tuesday, July 22, 2014

HCatalog and Parquet

I want to use Sqoop to import Teradata tables into Impala's Parquet tables. Because Sqoop doesn't support to write Parquet files directly, it SEEMS very promising to use Sqoop HCatalog to write Parquet tables. Unfortunately I realized that it would not work at this time (CDH 5.1.0 with Hive 0.12) after several day's error and trial.

"Should never be used", you will see this error like this page. Hive 0.13 won't help too. Check out this Jira. This is a HCatalog problem.

I tried this solution:

Dump data to an ORCFile table using sqoop hcatalog.
Run a hive query to insert into the Parquet table.

So I need 30 mins to dump a big table to my small cluster and another 7 mins for Hive insert query.Because Impala doesn't support ORCFile, I have to convert the data into Parquet.

I hate this solution!!! This is why I studied the code of HCatalog to see why Parquet table failed. I figure outed a hack so that I can use HCatalog to dump data into Parquet directly.

The basic idea is to extend MapredParquetOutputFormat to support getRecordWriter. Checkout my github project hcat-parquet. The readme explains why HCatalog doesn't work with Parquet, and how to use the new FileOututFormat.

11 comments:

hese2428August 26, 2014 at 10:16 AM
Hi, can you post the details of your code change? Thanks.
ReplyDelete
Replies
Ben's notesAugust 27, 2014 at 12:14 AM
This comment has been removed by the author.
ReplyDelete
Replies
hese2428September 10, 2014 at 10:24 AM
Hi Ben, thanks so much for posting your code! Can you confirm if there is support for writing out datetime/timestamp fields in hcat-parquet? Looking at the code I'm guessing no?
ReplyDelete
Replies
hese2428September 10, 2014 at 10:27 AM
Also, are you submitting your code to be merged into the mainstream project?!
ReplyDelete
Replies
UnknownNovember 14, 2014 at 6:36 PM
Hi Ben,
Thanks for the article. I want to use Hcatalog with parquet and having difficult time understanding if it work or not. Is there still compatibility issue ot it is resolved. I want to know why error :Shouldn't be used " is thrown and how we can get around to make it work. Or should I use Avro to make it work. Please help and advise.
ReplyDelete
Replies
Ben's notesNovember 17, 2014 at 12:01 PM
Hi Munish,

You can understand this error like this
1. Writing a parquet file needs a schema
2. When you uses ParquetOutputFormat#getRecordWriter directly, how could you pass the schema in? This might be why the developers of Parquet choose to throw an exception like "Should not be used ...".

HCatalog doesn't support Parquet well. I didn't check the latest version, but this is a fundamental issue of Hcatalog how to pass the schema to Parquet writers.

My project is a HACK to make HCatalog works with Parquet table.
ReplyDelete
Replies

Add comment