Tuesday, March 25, 2014

"Cannot get schema from loadFunc parquet.pig.ParquetLoader"

If you get this error
2014-03-25 14:17:48,933 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc parquet.pig.ParquetLoader
Details at logfile: /xxxx/pig_1395782266074.log
check the log file immediately because it is possible that the location in "store alias into 'location' using parquet.pig.ParquetLoader" doesn't exist.
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/tmp/seq-part/2014-03-24
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:243)
        at parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:291) 

Sunday, March 23, 2014

Add new hard disk on CentOS6

  • Add a new hard disk.
  • disk tool format the hard disk using "Master boot record".
  • start system-config-lvm
  • initialize entity
  • add to a volumn group
  • select the volume in "Volumn Groups->group->Logical View->lv_root", e.g. lv_root, edit properties
  • Use remainging

Friday, March 14, 2014

Run hadoop shell command Super Fast

If you run Hadoop shell commands on console or use them to write a script, you will hate that because it loads and starts JVM for every command. A command like "hadoop fs -ls /tmp/abc" usually takes 3~4 seconds on my VirtualBox VM running CentOS 6.5 with 8 virtual core and 12GB.

$ time hadoop fs -ls /tmp/abc
Found 2 items
drwxrwxrwx   - bwang supergroup          0 2014-03-10 16:25 /tmp/abc/2014-03-10
drwxr-xr-x   - bwang supergroup          0 2014-03-14 14:57 /tmp/abc/567

real 0m3.632s
user 0m4.146s
sys 0m2.650s

I have been curious whether Nailgun can help me save time by running those commands. I just figured out today. It turns out pretty easy.

  • Install nailgun: Just clone from github, and follow the instruction in README.md. I only ran "mvn clean package" and "make".
    $ cd ~/git/nailgun
    $ mvn clean package
    $ make
    $ ls
    Makefile        nailgun-examples  ng       README.md
    nailgun-client  nailgun-server    pom.xml
    $ ls nailgun-server/target/
    apidocs                 nailgun-server-0.9.2-SNAPSHOT.jar
    classes                 nailgun-server-0.9.2-SNAPSHOT-javadoc.jar
    javadoc-bundle-options  nailgun-server-0.9.2-SNAPSHOT-sources.jar
    maven-archiver          surefire
    maven-status
    
  • Start Nailgun server: the trick is you need to put Hadoop classpath.
    $ java -cp `hadoop classpath`:/home/bwang/git/nailgun/nailgun-server/target/nailgun-server-0.9.2-SNAPSHOT.jar com.martiansoftware.nailgun.NGServer
    NGServer 0.9.2-SNAPSHOT started on all interfaces, port 2113.
    
  • Setup aliases: you can setup aliases so that you can run the same hadoop shell command just like with nailgun.
    $ alias hadoop='$HOME/git/nailgun/ng'
    $ hadoop ng-alias fs org.apache.hadoop.fs.FsShell
    $ hadoop ng-alias
    fs              org.apache.hadoop.fs.FsShell                      
    
    ng-alias        com.martiansoftware.nailgun.builtins.NGAlias      
                    Displays and manages command aliases
    
    ng-cp           com.martiansoftware.nailgun.builtins.NGClasspath  
                    Displays and manages the current system classpath
    
    ng-stats        com.martiansoftware.nailgun.builtins.NGServerStats
                    Displays nail statistics
    
    ng-stop         com.martiansoftware.nailgun.builtins.NGStop       
                    Shuts down the nailgun server
    
    ng-version      com.martiansoftware.nailgun.builtins.NGVersion    
                    Displays the server version number.
    $ time hadoop fs -ls /tmp/abc
    Found 2 items
    drwxrwxrwx   - bwang supergroup          0 2014-03-10 16:25 /tmp/abc/2014-03-10
    drwxr-xr-x   - bwang supergroup          0 2014-03-14 14:57 /tmp/abc/567
    
    real    0m0.046s
    user    0m0.000s
    sys     0m0.008s
    
  • create some shell script so that you won't remember those long command.

Thursday, March 13, 2014

Parquet "java.lang.NoClassDefFoundError: org/apache/thrift/TEnum"

If you encounter this problem using Cloudera parcels, here is the solution according to this page
org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoClassDefFoundError: org/apache/thrift/TEnum
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:21)
 at parquet.hadoop.ParquetOutputFormat.getCodec(ParquetOutputFormat.java:217)
 at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:254)
 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:84)
 at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:562)
 at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:636)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:404)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:160)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:155)
Caused by: java.lang.ClassNotFoundException: org.apache.thrift.TEnum
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 24 more
$ cd /opt/cloudera/parcels/CDH/lib/parquet
$ ls original-parquet*.jar
# delete the jars
$ rm original-parquet*.jar
$ cd /opt/cloudera/parcels/CDH/lib/hadoop
# delete those symlinks
$ rm original-parquet*.jar