Tuesday, July 22, 2014

HCatalog and Parquet

I want to use Sqoop to import Teradata tables into Impala's Parquet tables. Because Sqoop doesn't support to write Parquet files directly, it SEEMS very promising to use Sqoop HCatalog to write Parquet tables. Unfortunately I realized that it would not work at this time (CDH 5.1.0 with Hive 0.12) after several day's error and trial.

"Should never be used", you will see this error like this page. Hive 0.13 won't help too. Check out this Jira. This is a HCatalog problem.

I tried this solution:

  • Dump data to an ORCFile table using sqoop hcatalog.
  • Run a hive query to insert into the Parquet table.
So I need 30 mins to dump a big table to my small cluster and another 7 mins for Hive insert query.Because Impala doesn't support ORCFile, I have to convert the data into Parquet.

I hate this solution!!! This is why I studied the code of HCatalog to see why Parquet table failed. I figure outed a hack so that I can use HCatalog to dump data into Parquet directly.

The basic idea is to extend MapredParquetOutputFormat to support getRecordWriter. Checkout my github project hcat-parquet. The readme explains why HCatalog doesn't work with Parquet, and how to use the new FileOututFormat.

11 comments:

  1. Hi, can you post the details of your code change? Thanks.

    ReplyDelete
    Replies
    1. I shared the code in github https://github.com/bewang-tech/hcat-parquet

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi Ben, thanks so much for posting your code! Can you confirm if there is support for writing out datetime/timestamp fields in hcat-parquet? Looking at the code I'm guessing no?

    ReplyDelete
    Replies
    1. It doesn't support DateTime/Timestamp. Parquet already supports it. But in CDH 5.1, I don't think it includes that fix. I always use string for datetime and timestamp. I remember HCatalog uses hive-parquet internally, which doesn't support timestamp as CDH 5.1.0.

      Delete
    2. Awesome... I assumed imapala timestamp fields should have the DATETIME datatype in the parquet file...String worked like charm. Thanks for the pointer!

      Delete
    3. Spoke too soon...I had the table in impala as string thats why it worked.

      Delete
  4. Also, are you submitting your code to be merged into the mainstream project?!

    ReplyDelete
    Replies
    1. I would call this as a HACK. Because I couldn't wait for the fix, I had to do myself. I think there should be a better way, but I don't have the time to find it.

      Delete
  5. Hi Ben,
    Thanks for the article. I want to use Hcatalog with parquet and having difficult time understanding if it work or not. Is there still compatibility issue ot it is resolved. I want to know why error :Shouldn't be used " is thrown and how we can get around to make it work. Or should I use Avro to make it work. Please help and advise.

    ReplyDelete
  6. Hi Munish,

    You can understand this error like this
    1. Writing a parquet file needs a schema
    2. When you uses ParquetOutputFormat#getRecordWriter directly, how could you pass the schema in? This might be why the developers of Parquet choose to throw an exception like "Should not be used ...".

    HCatalog doesn't support Parquet well. I didn't check the latest version, but this is a fundamental issue of Hcatalog how to pass the schema to Parquet writers.

    My project is a HACK to make HCatalog works with Parquet table.

    ReplyDelete