Monday, July 28, 2014

Thursday, July 24, 2014

Teradata create a table like

create table target_db.target_table as src_db.src_table with no data; create table target_db.target_table as (select * from src_db.src_view) with no data;

Tuesday, July 22, 2014

HCatalog and Parquet

I want to use Sqoop to import Teradata tables into Impala's Parquet tables. Because Sqoop doesn't support to write Parquet files directly, it SEEMS very promising to use Sqoop HCatalog to write Parquet tables. Unfortunately I realized that it would not work at this time (CDH 5.1.0 with Hive 0.12) after several day's error and trial.

"Should never be used", you will see this error like this page. Hive 0.13 won't help too. Check out this Jira. This is a HCatalog problem.

I tried this solution:

  • Dump data to an ORCFile table using sqoop hcatalog.
  • Run a hive query to insert into the Parquet table.
So I need 30 mins to dump a big table to my small cluster and another 7 mins for Hive insert query.Because Impala doesn't support ORCFile, I have to convert the data into Parquet.

I hate this solution!!! This is why I studied the code of HCatalog to see why Parquet table failed. I figure outed a hack so that I can use HCatalog to dump data into Parquet directly.

The basic idea is to extend MapredParquetOutputFormat to support getRecordWriter. Checkout my github project hcat-parquet. The readme explains why HCatalog doesn't work with Parquet, and how to use the new FileOututFormat.

Monday, July 21, 2014

Run Spark Shell locally

If you want to run spark to access the local file system, here is the simple way:
HADOOP_CONF_DIR=. MASTER=local spark-shell
If you don't give HADOOP_CONF_DIR, spark will use /etc/hadoop/conf which may point to a cluster running in pseduo mode. When HADOOP_CONF_DIR points to a dir without any Hadoop configuration, the file system will be local. It also works for spark-submit.