When you use Pig to process data and put into a Hive table, you need to be careful about the Pig namespace. It is possible that your complex Pig scripts may have namespace after group-by. And
parquet.pig.ParquetStorer will keep the namespace prefix into the schema. Unfortunately, Hive Parquet map the table column to Parquet column by simply comparing string. See
DataWritableReadSupport.java
@Override
public parquet.hadoop.api.ReadSupport.ReadContext init(final Configuration configuration,
final Map keyValueMetaData, final MessageType fileSchema) {
final String columns = configuration.get(IOConstants.COLUMNS);
final Map contextMetadata = new HashMap();
if (columns != null) {
final List listColumns = getColumns(columns);
final List typeListTable = new ArrayList();
for (final String col : listColumns) {
// listColumns contains partition columns which are metadata only
if (fileSchema.containsField(col)) { // containsField return false because col doesn't have namespace.
typeListTable.add(fileSchema.getType(col));
} else {
// below allows schema evolution
typeListTable.add(new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
}
}
and
GroupType.java:
public boolean containsField(String name) {
return indexByName.containsKey(name);
}
There is not any name resolution.
If you want Pig generate Hive readable Parquet files, you'd better give a lower-case name for each column before storing.
I haven't tried whether writing into HCatalog table works or not. But HCatalog cannot write Parquet table in CDH 5.1.0, see my blog how to fix it.
No comments:
Post a Comment