I want to use Sqoop to import Teradata tables into Impala's Parquet tables. Because Sqoop doesn't support to write Parquet files directly, it SEEMS very promising to use Sqoop HCatalog to write Parquet tables. Unfortunately I realized that it would not work at this time (CDH 5.1.0 with Hive 0.12) after several day's error and trial.
"Should never be used", you will see this error like this page. Hive 0.13 won't help too. Check out this Jira. This is a HCatalog problem.
I tried this solution:
- Dump data to an ORCFile table using sqoop hcatalog.
- Run a hive query to insert into the Parquet table.
I hate this solution!!! This is why I studied the code of HCatalog to see why Parquet table failed. I figure outed a hack so that I can use HCatalog to dump data into Parquet directly.
The basic idea is to extend MapredParquetOutputFormat to support getRecordWriter. Checkout my github project hcat-parquet. The readme explains why HCatalog doesn't work with Parquet, and how to use the new FileOututFormat.
Hi, can you post the details of your code change? Thanks.
ReplyDeleteI shared the code in github https://github.com/bewang-tech/hcat-parquet
DeleteThis comment has been removed by the author.
ReplyDeleteHi Ben, thanks so much for posting your code! Can you confirm if there is support for writing out datetime/timestamp fields in hcat-parquet? Looking at the code I'm guessing no?
ReplyDeleteIt doesn't support DateTime/Timestamp. Parquet already supports it. But in CDH 5.1, I don't think it includes that fix. I always use string for datetime and timestamp. I remember HCatalog uses hive-parquet internally, which doesn't support timestamp as CDH 5.1.0.
DeleteAwesome... I assumed imapala timestamp fields should have the DATETIME datatype in the parquet file...String worked like charm. Thanks for the pointer!
DeleteSpoke too soon...I had the table in impala as string thats why it worked.
DeleteAlso, are you submitting your code to be merged into the mainstream project?!
ReplyDeleteI would call this as a HACK. Because I couldn't wait for the fix, I had to do myself. I think there should be a better way, but I don't have the time to find it.
DeleteHi Ben,
ReplyDeleteThanks for the article. I want to use Hcatalog with parquet and having difficult time understanding if it work or not. Is there still compatibility issue ot it is resolved. I want to know why error :Shouldn't be used " is thrown and how we can get around to make it work. Or should I use Avro to make it work. Please help and advise.
Hi Munish,
ReplyDeleteYou can understand this error like this
1. Writing a parquet file needs a schema
2. When you uses ParquetOutputFormat#getRecordWriter directly, how could you pass the schema in? This might be why the developers of Parquet choose to throw an exception like "Should not be used ...".
HCatalog doesn't support Parquet well. I didn't check the latest version, but this is a fundamental issue of Hcatalog how to pass the schema to Parquet writers.
My project is a HACK to make HCatalog works with Parquet table.