My Tech Notes: 2014

Tuesday, December 23, 2014

Spark and Parquet with large block size

One of issue when I run a Spark application in yarn cluster mode is that my executor container is killed because the memory exceeds memory limits. NodeManager's log shows the folowing messages:

2014-12-22 09:22:48,906 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26453 for container-id container_1418853356063_0165_01_000005_01: 45.8 GB of 44.5 GB physical memory used; 46.5 GB of 93.4 GB virtual memory used
2014-12-22 09:22:48,907 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_1418853356063_0165_01_000005_01 has processes older than 1 iteration running over the configured limit. Limit=47781511168, current usage = 49199300608
2014-12-22 09:22:48,907 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=26453,containerID=container_1418853356063_0165_01_000005_01] is running beyond physical memory limits. Current usage: 45.8 GB of 44.5 GB physical memory used; 46.5 GB of 93.4 GB virtual memory used. Killing container.
Dump of the process-tree for container_1418853356063_0165_01_000005_01 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 26526 26453 26453 26453 (java) 315870 8116 49777754112 12011233 /usr/java/jdk1.7.0_67-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms45056m -Xmx45056m -Djava.io.tmpdir=/grid/2/yarn/nm/usercache/bwang/appcache/application_1418853356063_0165/container_1418853356063_0165_01_000005_01/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@hadoop-data:48923/user/CoarseGrainedScheduler 3 hadoop-data 23
        |- 26453 26450 26453 26453 (bash) 1 1 108658688 315 /bin/bash -c /usr/java/jdk1.7.0_67-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms45056m -Xmx45056m  -Djava.io.tmpdir=/grid/2/yarn/nm/usercache/bwang/appcache/application_1418853356063_0165/container_1418853356063_0165_01_000005_01/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@hadoop-data:48923/user/CoarseGrainedScheduler 3 hadoop-data 23 1> /var/log/hadoop-yarn/container/application_1418853356063_0165/container_1418853356063_0165_01_000005_01/stdout 2> /var/log/hadoop-yarn/container/application_1418853356063_0165/container_1418853356063_0165_01_000005_01/stderr

If an executor is lost, Spark will spawn another executor to recover from this failure. Under this situation, any persisted or cached RDDs are lost too, Spark must re-process from the beginning. This usually leads to a lengthy processing time. In any case, you should avoid this situation happening. For example, one of my application can finish in 5 mins if no lost executor happens, otherwise it could take more than 40 mins.

Because Spark builds an in-memory hash when doing groupByKey or combineByKey, which needs a lot of memory. One of suggestion is to use a larger parallelism to make the reduce task smaller.

If your application read Parquet files, you must be careful to choose executor memory and cores by taking into account of the block size of parquet files because reading Parquet files may consume a lot of memory. For example, Impala writes Parquet files using 1GB HDFS blocksize by default. My application needs to read 136 parquet files output by Impala. The application runs on 4 nodes with 24 virtual cores each node using "--executor-memory 44G --executor-cores 23 --num-executors 4", and my executors will be killed. But if I use 10 cores per executor "--executor-cores 10", everything passes through without any executor is lost. The reason is when you read 136 parquet files, there are 23 running tasks in one executor at the same time, so 23GB are allocated for reading Parquet, and 23 tasks try to build hash in memory. By using 10 cores, there are more memory for one task, and each task has enough memory to build the hash. Even the job is a little bit slow, it actually save time because no re-computation due to the lost executors.

Monday, November 3, 2014

"NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;" when running spark-cassandra-connector

When I tried to update cassandra table using spark-cassandra-connector in a Spark application, I encountered this problem. The reason is that there are multiple versions of com.google.guava:guava exists. I'm using CDH 5.1.0 with Spark 1.0.0. Spark uses guava-14.0.1, Hadoop mapreduce use guava-11.0.2, and spark-cassandra-connector uses guava-15.0.0. The similar issue is reported here: https://github.com/datastax/spark-cassandra-connector/issues/292 and https://github.com/datastax/spark-cassandra-connector/issues/326 I tried to use spark.files.userClassPathFirst=true, there are other errors. I tried to put guava-15.0 jar to SPARK_CLASSPATH, the driver side didn't report error, but failed on the worker side. Actually the solution is very simple, in you Spark project, exclude guava from spark-cassandra-connector.

  
   com.datastax.spark
   spark-cassandra-connector_${scala.major}
   
    
     com.codahale.metrics
     metrics-core
    
    
     com.google.guava
     guava
    
   
   1.1.0-beta1

When you run spark-submit, don't put guava-15.0 to --jars or the classpath.

Friday, October 24, 2014

Build scalatest-eclipse-plugin for Scala-IDE 4.0.0 RC1

I upgraded Scala-IDE in Eclipse Luna to 4.0.0 RC1, but didn't find Scalatest plugin. It is inconvenient and time consuming to run "mvn test" every time I change the test. I decided to build Scalatest plugin by myself. It turns out pretty easy, here is how:

Clone from github: git clone https://github.com/scalatest/scalatest-eclipse-plugin.git
Use branch kepler-nightly-2.11: git checkout -b nightly-2.11 kepler-nightly-2.11

Change luna profile in scalatest-eclipse-plugin/pom.xml: replace "ecosystem/next" with "sdk".

    
      scala-ide-4.0-2_11-luna
      
        2.11.2
        2_11
        2.11
        v-4
        ${repo.scala-ide.root}/sdk/lithium/e44/scala211/dev/site/

Build: mvn clean package
Install in eclipse, use "file:/${path}/scalatest-eclipse-plugin/org.scala-ide.sdt.scalatest.update-site/target/site/" in "Install new software ..."

Wednesday, October 22, 2014

Solarized Light (Scala) for Eclipse

Just install ScalaIDE 4.0.0-RC1 on my Eclipse Luna. I really like "Solarized Light (Scala)" color theme as shown in the release note.

You need to install "Eclipse Color Theme Plugin", otherwise you could not find "Prefrences->General->Appearance->Color Theme".
You need to download the theme from this page. Just click Eclipse Color Theme (XML) - for Eclipse Color Theme Plugin.
Go to "Preferences->General->Appearance->Color Theme, click the button "Import a theme", choose the download file and import it.
Then you may find that not all syntax color are same as the picture in the release note. For example, class and trait names are not colored. If so, you need to check the configuration in "Prefrences->Scala->Editor->Syntax Coloring". Not all "Semantic" item are enabled by default. Just enable them and everything will be colored.

Thursday, October 16, 2014

Add Terminals View

I really like Eclipse Luna's new feature "Terminals". It is convenient so that you don't have to leave Eclipse to open a terminal to type your command. And Mylyn task still record the time you spend on command lines. Each time when you click "Window->Show View->Other->Terminals", a new terminal view is created, and you can open one or more terminal in that view. I'd like open ONLY one terminal and run gnu screen in it so that I can switch to a new screen or scroll up and down without moving the mouse. There is another benefit when doing that, even you restart Eclipse, you won't lost your screen session. Just open a new Terminals view and resume the session using "screen -R". But there is one issue: a new blank terminals view is created when you are in a new perspective and try to "show view", the terminals you already open won't show up. How can I share the screen session among different perspectives, for example, java/java ee/scala? Here is how you can fix it: don't stop there, just click "Open a terminal" button in the blank terminals view and click OK to create a new local terminal. In my eclipse, a view "Terminals 1" is shown up with two terminals, one is running gnu screen and the other is new created. Once you have the view "Terminals 1", you can close the blank terminals view. This sounds a bug to me.

Thursday, August 28, 2014

Impala-shell may have control sequence in its output

Assume you have a table has three rows, what is the result in the output file /tmp/my_table_count? 3? Actually it is not. There is a control sequence "ESC[?1034h" on my terminal.

$ impala-shell -B -q "select count(1) from my_table" > /tmp/my_table_count
$ xxd /tmp/my_table_count
0000000: 1b5b 3f31 3033 3468 320a                 .[?1034h2.

It will cause a problem when you use the result in a script which tries to update a partition's numRows in Impala.

  local a=$(impala-shell -B -q "select count(1) from my_table where part_col='2014-08-28'")
  impala-shell -q "alter table my_table partition(part_col='2014-08-20') set tblproperties('numRows'='$a')"

If you run this script, you will get the wrong value for #Rows due to the escape control sequence.

Query: show table stats my_table
+------------+-------+--------+--------+--------------+---------+
| part_col   | #Rows | #Files | Size   | Bytes Cached | Format  |
+------------+-------+--------+--------+--------------+---------+
| 2014-08-28 | -1    | 2      | 2.87KB | NOT CACHED   | PARQUET |
| Total      | -1    | 2      | 2.87KB | 0B           |         |
+------------+-------+--------+--------+--------------+---------+
Returned 2 row(s) in 0.06s

You can fix it by unsetting TERM like this:

  local a=$(TERM= impala-shell -B -q "select count(1) from my_table where part_col='2014-08-28'")
  impala-shell -q "alter table my_table partition(part_col='2014-08-20') set tblproperties('numRows'='$a')"

Friday, August 15, 2014

Set replication for files in Hadoop

Change the existing files's replications
```
hadoop fs -setrep -R -w 2 /data-dir
```

Set replication when loading a file

hadoop fs -Ddfs.replication=2 -put local_file dfs_file

Thursday, August 14, 2014

Parquet Schema Incompatible between Pig and Hive

When you use Pig to process data and put into a Hive table, you need to be careful about the Pig namespace. It is possible that your complex Pig scripts may have namespace after group-by. And parquet.pig.ParquetStorer will keep the namespace prefix into the schema. Unfortunately, Hive Parquet map the table column to Parquet column by simply comparing string. See DataWritableReadSupport.java

  @Override
  public parquet.hadoop.api.ReadSupport.ReadContext init(final Configuration configuration,
      final Map keyValueMetaData, final MessageType fileSchema) {
    final String columns = configuration.get(IOConstants.COLUMNS);
    final Map contextMetadata = new HashMap();
    if (columns != null) {
      final List listColumns = getColumns(columns);

      final List typeListTable = new ArrayList();
      for (final String col : listColumns) {
        // listColumns contains partition columns which are metadata only
        if (fileSchema.containsField(col)) { // containsField return false because col doesn't have namespace.
          typeListTable.add(fileSchema.getType(col)); 
        } else {
          // below allows schema evolution
          typeListTable.add(new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
        }
      }

and GroupType.java:

  public boolean containsField(String name) {
    return indexByName.containsKey(name);
  }

There is not any name resolution. If you want Pig generate Hive readable Parquet files, you'd better give a lower-case name for each column before storing. I haven't tried whether writing into HCatalog table works or not. But HCatalog cannot write Parquet table in CDH 5.1.0, see my blog how to fix it.

Monday, July 28, 2014

Links of Hadoop Cluster Hardware

HardWare
- Cloudera: How-to: Select the Right Hardware for Your New Hadoop Cluster
- Hortonworks: Best Practices for Selecting Apache Hadoop Hardware
- Hortonworks: Chapter 1. Hardware Recommendations For Apache Hadoop
- Pivotal: Pivotal HD Installation Overview: BEST PRACTICES FOR SELECTING HARDWARE
Virtual Hadoop
Disk setup
- Apache Hadoop wiki: Setting up Disks for Hadoop
- Hortonworks: Chapter 2. File System Partitioning Recommendations
HP Solutions for Apache Hadoop
- HP Reference Architecture for Cloudera Enterprise 5 on ProLiant DL Servers
IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
Dell | Cloudera Solution Reference Architecture v2.1.0

Thursday, July 24, 2014

Teradata create a table like

create table target_db.target_table as src_db.src_table with no data; create table target_db.target_table as (select * from src_db.src_view) with no data;

Tuesday, July 22, 2014

HCatalog and Parquet

I want to use Sqoop to import Teradata tables into Impala's Parquet tables. Because Sqoop doesn't support to write Parquet files directly, it SEEMS very promising to use Sqoop HCatalog to write Parquet tables. Unfortunately I realized that it would not work at this time (CDH 5.1.0 with Hive 0.12) after several day's error and trial.

"Should never be used", you will see this error like this page. Hive 0.13 won't help too. Check out this Jira. This is a HCatalog problem.

I tried this solution:

Dump data to an ORCFile table using sqoop hcatalog.
Run a hive query to insert into the Parquet table.

So I need 30 mins to dump a big table to my small cluster and another 7 mins for Hive insert query.Because Impala doesn't support ORCFile, I have to convert the data into Parquet.

I hate this solution!!! This is why I studied the code of HCatalog to see why Parquet table failed. I figure outed a hack so that I can use HCatalog to dump data into Parquet directly.

The basic idea is to extend MapredParquetOutputFormat to support getRecordWriter. Checkout my github project hcat-parquet. The readme explains why HCatalog doesn't work with Parquet, and how to use the new FileOututFormat.

Monday, July 21, 2014

Run Spark Shell locally

If you want to run spark to access the local file system, here is the simple way:

HADOOP_CONF_DIR=. MASTER=local spark-shell

If you don't give HADOOP_CONF_DIR, spark will use /etc/hadoop/conf which may point to a cluster running in pseduo mode. When HADOOP_CONF_DIR points to a dir without any Hadoop configuration, the file system will be local. It also works for spark-submit.

Thursday, April 10, 2014

Git subtree to manage vim plugins

I found those blogs http://blogs.atlassian.com/2013/05/alternatives-to-git-submodule-git-subtree and http://endot.org/2011/05/18/git-submodules-vs-subtrees-for-vim-plugins/ about how to use Git subtree. It seems straight forward, but I actually encountered a lot of errors before I successfully created one.

You need to create a git repository using 'git init' in the dir you want to have all plugins before running 'git subtree'.

[ben@localhost vim-plugins]$ git subtree add --prefix ./vim-pathogen https://github.com/tpope/vim-pathogen.git master --squash
fatal: Not a git repository (or any parent up to mount point /home)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[ben@localhost vim-plugins]$ git init
Initialized empty Git repository in /home/ben/git/vim-plugins/.git/

You have to have at least one commit.

[ben@localhost vim-plugins]$ git subtree add --prefix ./vim-pathogen https://github.com/tpope/vim-pathogen.git master --squash
fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git  [...] -- [...]'
Working tree has modifications.  Cannot add.
[ben@localhost vim-plugins]$ vi README.md
[ben@localhost vim-plugins]$ git add .
[ben@localhost vim-plugins]$ git commit -m "initial commit"
[master (root-commit) 70de380] initial commit
 1 file changed, 2 insertions(+)
 create mode 100644 README.md

Then you might see this error. What's wrong? Don't use any path './'. I tried '$PWD/vim-pathogen', the same error. I'm using Git 1.8.3 on Fedora 19. Not sure it is a version issue or not.

[ben@localhost vim-plugins]$ git subtree add --prefix ./vim-pathogen https://github.com/tpope/vim-pathogen.git master 
git fetch https://github.com/tpope/vim-pathogen.git master
warning: no common commits
remote: Counting objects: 499, done.
remote: Compressing objects: 100% (235/235), done.
remote: Total 499 (delta 148), reused 499 (delta 148)
Receiving objects: 100% (499/499), 82.10 KiB | 0 bytes/s, done.
Resolving deltas: 100% (148/148), done.
From https://github.com/tpope/vim-pathogen
 * branch            master     -> FETCH_HEAD
error: Invalid path './vim-pathogen/CONTRIBUTING.markdown'
error: Invalid path './vim-pathogen/README.markdown'
error: Invalid path './vim-pathogen/autoload/pathogen.vim'
error: pathspec 'vim-pathogen' did not match any file(s) known to git.

[ben@localhost vim-plugins]$ git subtree add --prefix vim-pathogen https://github.com/tpope/vim-pathogen.git master 
git fetch https://github.com/tpope/vim-pathogen.git master
From https://github.com/tpope/vim-pathogen
 * branch            master     -> FETCH_HEAD
Added dir 'vim-pathogen'

You can define a repository using 'git remote add', then use name instead of the url.

[ben@localhost vim-plugins]$ git remote add hive https://github.com/autowitch/hive.vim.git
[ben@localhost vim-plugins]$ git subtree add -P hive.vim hive master
git fetch hive master
From https://github.com/autowitch/hive.vim
 * branch            master     -> FETCH_HEAD
Added dir 'hive.vim'

You may not use 'tabular/master' unless you fetch tabular.

[ben@localhost vim-plugins]$ git remote add tabular https://github.com/godlygeek/tabular.git
[ben@localhost vim-plugins]$ git subtree add -P tabular tabular/master
'tabular/master' does not refer to a commit

[ben@localhost vim-plugins]$ git fetch tabular
warning: no common commits
remote: Reusing existing pack: 131, done.
remote: Total 131 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (131/131), 32.36 KiB | 0 bytes/s, done.
Resolving deltas: 100% (48/48), done.
From https://github.com/godlygeek/tabular
 * [new branch]      gtabularize -> tabular/gtabularize
 * [new branch]      master     -> tabular/master
 * [new branch]      pattern_reuse -> tabular/pattern_reuse
 * [new branch]      trim_1st_field -> tabular/trim_1st_field
[ben@localhost vim-plugins]$ git subtree add -P tabular tabular/master
Added dir 'tabular'

create a file under ~/git/vim-plugins/.vimrc like below. Then make a symlink to ~/.vimrc.

source ~/git/vim-plugins/vim-pathogen/autoload/pathogen.vim
execute pathogen#infect('~/git/vim-plugins/{}')

syntax on
filetype plugin indent on
set sw=2

augroup filetypedetect
  au BufNewFile,BufRead *.pig set filetype=pig syntax=pig
  au BufNewFile,BufRead *.hql set filetype=hive syntax=hive
augroup END

qgit works with subtree.
Because you have to add '--prefix name' in subtree command, you cannot pull all repos. There is not 'git subtree pull-all' yet, see http://ruleant.blogspot.nl/2013/06/git-subtree-module-with-gittrees-config.html.

Tuesday, March 25, 2014

"Cannot get schema from loadFunc parquet.pig.ParquetLoader"

If you get this error

2014-03-25 14:17:48,933 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc parquet.pig.ParquetLoader
Details at logfile: /xxxx/pig_1395782266074.log

check the log file immediately because it is possible that the location in "store alias into 'location' using parquet.pig.ParquetLoader" doesn't exist.

Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/tmp/seq-part/2014-03-24
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:243)
        at parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:291)

Sunday, March 23, 2014

Add new hard disk on CentOS6

Add a new hard disk.
disk tool format the hard disk using "Master boot record".
start system-config-lvm
initialize entity
add to a volumn group
select the volume in "Volumn Groups->group->Logical View->lv_root", e.g. lv_root, edit properties
Use remainging

Friday, March 14, 2014

Run hadoop shell command Super Fast

If you run Hadoop shell commands on console or use them to write a script, you will hate that because it loads and starts JVM for every command. A command like "hadoop fs -ls /tmp/abc" usually takes 3~4 seconds on my VirtualBox VM running CentOS 6.5 with 8 virtual core and 12GB.

$ time hadoop fs -ls /tmp/abc
Found 2 items
drwxrwxrwx   - bwang supergroup          0 2014-03-10 16:25 /tmp/abc/2014-03-10
drwxr-xr-x   - bwang supergroup          0 2014-03-14 14:57 /tmp/abc/567

real 0m3.632s
user 0m4.146s
sys 0m2.650s

I have been curious whether Nailgun can help me save time by running those commands. I just figured out today. It turns out pretty easy.

Install nailgun: Just clone from github, and follow the instruction in README.md. I only ran "mvn clean package" and "make".

$ cd ~/git/nailgun
$ mvn clean package
$ make
$ ls
Makefile        nailgun-examples  ng       README.md
nailgun-client  nailgun-server    pom.xml
$ ls nailgun-server/target/
apidocs                 nailgun-server-0.9.2-SNAPSHOT.jar
classes                 nailgun-server-0.9.2-SNAPSHOT-javadoc.jar
javadoc-bundle-options  nailgun-server-0.9.2-SNAPSHOT-sources.jar
maven-archiver          surefire
maven-status

Start Nailgun server: the trick is you need to put Hadoop classpath.

$ java -cp `hadoop classpath`:/home/bwang/git/nailgun/nailgun-server/target/nailgun-server-0.9.2-SNAPSHOT.jar com.martiansoftware.nailgun.NGServer
NGServer 0.9.2-SNAPSHOT started on all interfaces, port 2113.

Setup aliases: you can setup aliases so that you can run the same hadoop shell command just like with nailgun.

$ alias hadoop='$HOME/git/nailgun/ng'
$ hadoop ng-alias fs org.apache.hadoop.fs.FsShell
$ hadoop ng-alias
fs              org.apache.hadoop.fs.FsShell                      

ng-alias        com.martiansoftware.nailgun.builtins.NGAlias      
                Displays and manages command aliases

ng-cp           com.martiansoftware.nailgun.builtins.NGClasspath  
                Displays and manages the current system classpath

ng-stats        com.martiansoftware.nailgun.builtins.NGServerStats
                Displays nail statistics

ng-stop         com.martiansoftware.nailgun.builtins.NGStop       
                Shuts down the nailgun server

ng-version      com.martiansoftware.nailgun.builtins.NGVersion    
                Displays the server version number.
$ time hadoop fs -ls /tmp/abc
Found 2 items
drwxrwxrwx   - bwang supergroup          0 2014-03-10 16:25 /tmp/abc/2014-03-10
drwxr-xr-x   - bwang supergroup          0 2014-03-14 14:57 /tmp/abc/567

real    0m0.046s
user    0m0.000s
sys     0m0.008s

create some shell script so that you won't remember those long command.

Thursday, March 13, 2014

Parquet "java.lang.NoClassDefFoundError: org/apache/thrift/TEnum"

If you encounter this problem using Cloudera parcels, here is the solution according to this page

org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoClassDefFoundError: org/apache/thrift/TEnum
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:21)
 at parquet.hadoop.ParquetOutputFormat.getCodec(ParquetOutputFormat.java:217)
 at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:254)
 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:84)
 at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:562)
 at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:636)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:404)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:160)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:155)
Caused by: java.lang.ClassNotFoundException: org.apache.thrift.TEnum
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 24 more

$ cd /opt/cloudera/parcels/CDH/lib/parquet
$ ls original-parquet*.jar
# delete the jars
$ rm original-parquet*.jar
$ cd /opt/cloudera/parcels/CDH/lib/hadoop
# delete those symlinks
$ rm original-parquet*.jar

Thursday, February 27, 2014

Yarn MapReduce Log Level

When you write a Pig or Hive UDF, a debug log may be very useful. You don't have to ask your administrator for help. Setting property 'mapreduce.map.log.level' or 'mapreduce.reduce.log.level' to 'DEBUG', that is it.

set mapreduce.map.log.level 'DEBUG'

register '/home/bwang/workspace/pig-scratch/target/pig-scratch-0.0.1-SNAPSHOT.jar';

define asByteArray com.mycompany.pig.SaveObjectAsByteArray();

d_val = load '/tmp/double.txt' as (id: chararray, val: double);

dba_val = foreach d_val generate flatten(asByteArray(*)), 0 as pos;

g = group dba_val all;

dump g;

But this is not perfect in that you will see a lot of Hadoop DEBUG log information too when you check the syslog of a map task.

Can I just output the DEBUG log for my own class? log4j definitely supports that, but it is not straight forward in Yarn MapReduce job as you think.

If you read MRApps.addLog4jSystemProperties, you will find that the log4j.configuration is actually hard coded to 'container-log4j.properties', which is packed in hadoop-yarn-server-nodemanager-2.0.0-cdh4.5.0.jar.

I found a way to fool NodeManager to achieve setting the log level for my UDF class:

Find container-log4j.properties in maven dependencies.
Copy the content to a property file, e.g., as_byte_array.log4j.properties.
Add 'log4j.logger.com.mycompany.pig=${com.mycompany.pig.logger}' into as_byte_array.log4j.properties.
Build a package and make sure as_byte_array.log4j.properties in your udf jar.

Change the pig script like this:

set mapreduce.map.log.level 'INFO,CLA -Dlog4j.configuration=as_byte_array.log4j.properties -Dcom.mycompany.pig.logger=DEBUG'

This is totally a HACK. If you check 'ps -ef | grep mycompany' when the map task is running, you will see something like this:

/usr/java/jdk1.7.0_25/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx825955249 -Djava.io.tmpdir=/yarn/nm/usercache/bwang/appcache/application_1393480846083_0011/container_1393480846083_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.mapreduce.container.log.dir=/var/log/hadoop-yarn/container/application_1393480846083_0011/container_1393480846083_0011_01_000002 -Dyarn.app.mapreduce.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dlog4j.configuration=as_byte_array.log4j.properties -Dcom.mycompany.pig.logger=DEBUG,CLA org.apache.hadoop.mapred.YarnChild 127.0.0.1 60261 attempt_1393480846083_0011_m_000000_0 2

Basically I inject a new log4j.configuration to point to my own log4j.properties, which overwrites container-log4j.properties because it appears behind it. And "-Dcom.mycompany.pig.logger=DEBUG" let me control the log level for my UDF.

Wednesday, January 22, 2014

start screen with different windows with different working directories

Create ~/.screenrc with the following lines

chdir $HOME/workspace/core-project
screen -t core
chdir $HOME/workspace/hadoop-project
screen -t hdp

When you run screen, there are two windows named "core" and "hdp", the working directory is "$HOME/workspace/core-project" in "core" window, and "$HOME/workspace/hadoop-project" in "hdp" window.

Using .screenrc means every screen session will have the same windows. If this is not the case, using alias and a customized screenrc file like this:

$ alias xxxscr='screen -D -R -S xxx -c ~/.xxx.screenrc'
$ xxxscr

You can quit and kill all windows using this key CTRL+A+|

Sunday, January 19, 2014

Maven Test

Don't run unit tests, run integration tests directly
```
$ mvn test-compile failsafe:integration-test
```
Run a single integration test in multiple modules projects.
```
$ mvn -am -pl :sub-module -Dit.test=MyIntegrationTest -DfailIfNoTests=false test-compile failsafe:integration-test
```
- -am allows resolving dependencies using the working directory.
- -pl :sub-module specifies the module
- -DfailIfNoTests=false allows not failing in dependent modules.