- HardWare
- Virtual Hadoop
- Disk setup
- Apache Hadoop wiki: Setting up Disks for Hadoop
- Hortonworks: Chapter 2. File System Partitioning Recommendations
- HP Solutions for Apache Hadoop
- IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
- Dell | Cloudera Solution Reference Architecture v2.1.0
Monday, July 28, 2014
Links of Hadoop Cluster Hardware
Thursday, July 24, 2014
Teradata create a table like
Tuesday, July 22, 2014
HCatalog and Parquet
I want to use Sqoop to import Teradata tables into Impala's Parquet tables. Because Sqoop doesn't support to write Parquet files directly, it SEEMS very promising to use Sqoop HCatalog to write Parquet tables. Unfortunately I realized that it would not work at this time (CDH 5.1.0 with Hive 0.12) after several day's error and trial.
"Should never be used", you will see this error like this page. Hive 0.13 won't help too. Check out this Jira. This is a HCatalog problem.
I tried this solution:
- Dump data to an ORCFile table using sqoop hcatalog.
- Run a hive query to insert into the Parquet table.
I hate this solution!!! This is why I studied the code of HCatalog to see why Parquet table failed. I figure outed a hack so that I can use HCatalog to dump data into Parquet directly.
The basic idea is to extend MapredParquetOutputFormat to support getRecordWriter. Checkout my github project hcat-parquet. The readme explains why HCatalog doesn't work with Parquet, and how to use the new FileOututFormat.
Monday, July 21, 2014
Run Spark Shell locally
HADOOP_CONF_DIR=. MASTER=local spark-shellIf you don't give HADOOP_CONF_DIR, spark will use /etc/hadoop/conf which may point to a cluster running in pseduo mode. When HADOOP_CONF_DIR points to a dir without any Hadoop configuration, the file system will be local. It also works for spark-submit.