Thursday, August 14, 2014

Parquet Schema Incompatible between Pig and Hive

When you use Pig to process data and put into a Hive table, you need to be careful about the Pig namespace. It is possible that your complex Pig scripts may have namespace after group-by. And parquet.pig.ParquetStorer will keep the namespace prefix into the schema. Unfortunately, Hive Parquet map the table column to Parquet column by simply comparing string. See DataWritableReadSupport.java
  @Override
  public parquet.hadoop.api.ReadSupport.ReadContext init(final Configuration configuration,
      final Map keyValueMetaData, final MessageType fileSchema) {
    final String columns = configuration.get(IOConstants.COLUMNS);
    final Map contextMetadata = new HashMap();
    if (columns != null) {
      final List listColumns = getColumns(columns);

      final List typeListTable = new ArrayList();
      for (final String col : listColumns) {
        // listColumns contains partition columns which are metadata only
        if (fileSchema.containsField(col)) { // containsField return false because col doesn't have namespace.
          typeListTable.add(fileSchema.getType(col)); 
        } else {
          // below allows schema evolution
          typeListTable.add(new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
        }
      }
and GroupType.java:
  public boolean containsField(String name) {
    return indexByName.containsKey(name);
  }
There is not any name resolution. If you want Pig generate Hive readable Parquet files, you'd better give a lower-case name for each column before storing. I haven't tried whether writing into HCatalog table works or not. But HCatalog cannot write Parquet table in CDH 5.1.0, see my blog how to fix it.

No comments:

Post a Comment