My Tech Notes: July 2014

Monday, July 28, 2014

Links of Hadoop Cluster Hardware

HardWare
- Cloudera: How-to: Select the Right Hardware for Your New Hadoop Cluster
- Hortonworks: Best Practices for Selecting Apache Hadoop Hardware
- Hortonworks: Chapter 1. Hardware Recommendations For Apache Hadoop
- Pivotal: Pivotal HD Installation Overview: BEST PRACTICES FOR SELECTING HARDWARE
Virtual Hadoop
Disk setup
- Apache Hadoop wiki: Setting up Disks for Hadoop
- Hortonworks: Chapter 2. File System Partitioning Recommendations
HP Solutions for Apache Hadoop
- HP Reference Architecture for Cloudera Enterprise 5 on ProLiant DL Servers
IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
Dell | Cloudera Solution Reference Architecture v2.1.0

Thursday, July 24, 2014

Teradata create a table like

create table target_db.target_table as src_db.src_table with no data; create table target_db.target_table as (select * from src_db.src_view) with no data;

Tuesday, July 22, 2014

I want to use Sqoop to import Teradata tables into Impala's Parquet tables. Because Sqoop doesn't support to write Parquet files directly, it SEEMS very promising to use Sqoop HCatalog to write Parquet tables. Unfortunately I realized that it would not work at this time (CDH 5.1.0 with Hive 0.12) after several day's error and trial.

"Should never be used", you will see this error like this page. Hive 0.13 won't help too. Check out this Jira. This is a HCatalog problem.

I tried this solution:

Dump data to an ORCFile table using sqoop hcatalog.
Run a hive query to insert into the Parquet table.

So I need 30 mins to dump a big table to my small cluster and another 7 mins for Hive insert query.Because Impala doesn't support ORCFile, I have to convert the data into Parquet.

I hate this solution!!! This is why I studied the code of HCatalog to see why Parquet table failed. I figure outed a hack so that I can use HCatalog to dump data into Parquet directly.

The basic idea is to extend MapredParquetOutputFormat to support getRecordWriter. Checkout my github project hcat-parquet. The readme explains why HCatalog doesn't work with Parquet, and how to use the new FileOututFormat.

Monday, July 21, 2014

Run Spark Shell locally

If you want to run spark to access the local file system, here is the simple way:

HADOOP_CONF_DIR=. MASTER=local spark-shell

If you don't give HADOOP_CONF_DIR, spark will use /etc/hadoop/conf which may point to a cluster running in pseduo mode. When HADOOP_CONF_DIR points to a dir without any Hadoop configuration, the file system will be local. It also works for spark-submit.