Thursday, August 14, 2014

Parquet Schema Incompatible between Pig and Hive

When you use Pig to process data and put into a Hive table, you need to be careful about the Pig namespace. It is possible that your complex Pig scripts may have namespace after group-by. And parquet.pig.ParquetStorer will keep the namespace prefix into the schema. Unfortunately, Hive Parquet map the table column to Parquet column by simply comparing string. See
  public parquet.hadoop.api.ReadSupport.ReadContext init(final Configuration configuration,
      final Map keyValueMetaData, final MessageType fileSchema) {
    final String columns = configuration.get(IOConstants.COLUMNS);
    final Map contextMetadata = new HashMap();
    if (columns != null) {
      final List listColumns = getColumns(columns);

      final List typeListTable = new ArrayList();
      for (final String col : listColumns) {
        // listColumns contains partition columns which are metadata only
        if (fileSchema.containsField(col)) { // containsField return false because col doesn't have namespace.
        } else {
          // below allows schema evolution
          typeListTable.add(new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
  public boolean containsField(String name) {
    return indexByName.containsKey(name);
There is not any name resolution. If you want Pig generate Hive readable Parquet files, you'd better give a lower-case name for each column before storing. I haven't tried whether writing into HCatalog table works or not. But HCatalog cannot write Parquet table in CDH 5.1.0, see my blog how to fix it.

No comments:

Post a Comment