Thursday, August 28, 2014

Impala-shell may have control sequence in its output

Assume you have a table has three rows, what is the result in the output file /tmp/my_table_count? 3? Actually it is not. There is a control sequence "ESC[?1034h" on my terminal.
$ impala-shell -B -q "select count(1) from my_table" > /tmp/my_table_count
$ xxd /tmp/my_table_count
0000000: 1b5b 3f31 3033 3468 320a                 .[?1034h2.
It will cause a problem when you use the result in a script which tries to update a partition's numRows in Impala.
  local a=$(impala-shell -B -q "select count(1) from my_table where part_col='2014-08-28'")
  impala-shell -q "alter table my_table partition(part_col='2014-08-20') set tblproperties('numRows'='$a')"
If you run this script, you will get the wrong value for #Rows due to the escape control sequence.
Query: show table stats my_table
+------------+-------+--------+--------+--------------+---------+
| part_col   | #Rows | #Files | Size   | Bytes Cached | Format  |
+------------+-------+--------+--------+--------------+---------+
| 2014-08-28 | -1    | 2      | 2.87KB | NOT CACHED   | PARQUET |
| Total      | -1    | 2      | 2.87KB | 0B           |         |
+------------+-------+--------+--------+--------------+---------+
Returned 2 row(s) in 0.06s
You can fix it by unsetting TERM like this:
  local a=$(TERM= impala-shell -B -q "select count(1) from my_table where part_col='2014-08-28'")
  impala-shell -q "alter table my_table partition(part_col='2014-08-20') set tblproperties('numRows'='$a')"

Friday, August 15, 2014

Set replication for files in Hadoop

  • Change the existing files's replications
    hadoop fs -setrep -R -w 2 /data-dir
    
  • Set replication when loading a file
    hadoop fs -Ddfs.replication=2 -put local_file dfs_file
    

Thursday, August 14, 2014

Parquet Schema Incompatible between Pig and Hive

When you use Pig to process data and put into a Hive table, you need to be careful about the Pig namespace. It is possible that your complex Pig scripts may have namespace after group-by. And parquet.pig.ParquetStorer will keep the namespace prefix into the schema. Unfortunately, Hive Parquet map the table column to Parquet column by simply comparing string. See DataWritableReadSupport.java
  @Override
  public parquet.hadoop.api.ReadSupport.ReadContext init(final Configuration configuration,
      final Map keyValueMetaData, final MessageType fileSchema) {
    final String columns = configuration.get(IOConstants.COLUMNS);
    final Map contextMetadata = new HashMap();
    if (columns != null) {
      final List listColumns = getColumns(columns);

      final List typeListTable = new ArrayList();
      for (final String col : listColumns) {
        // listColumns contains partition columns which are metadata only
        if (fileSchema.containsField(col)) { // containsField return false because col doesn't have namespace.
          typeListTable.add(fileSchema.getType(col)); 
        } else {
          // below allows schema evolution
          typeListTable.add(new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
        }
      }
and GroupType.java:
  public boolean containsField(String name) {
    return indexByName.containsKey(name);
  }
There is not any name resolution. If you want Pig generate Hive readable Parquet files, you'd better give a lower-case name for each column before storing. I haven't tried whether writing into HCatalog table works or not. But HCatalog cannot write Parquet table in CDH 5.1.0, see my blog how to fix it.