My Tech Notes: June 2013

Tuesday, June 18, 2013

Avro-mapred-1.7.3-hadoop2 for AvroMultipleOutputs

I got the following error message in my MapReduce job when I ran it in CDH 4.2.0 cluster:

2013-06-18 12:50:11,095 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
 at org.apache.avro.mapreduce.AvroMultipleOutputs.getNamedOutputsList(AvroMultipleOutputs.java:218)
 at org.apache.avro.mapreduce.AvroMultipleOutputs.(AvroMultipleOutputs.java:351)

It turns out that avro-mapred-1.7.3 causes this problem. My sbt project has a dependency on hive-exec which depends on avro-mapred-1.7.3. To eliminate this error, you should exclude avro-mapred from hive-exec, and add avro-mapred-1.7.3-hadoop2

If you have hive-exec-0.10.0-cdh4.2.0 in your project, you have trouble to see the source code for Avro because this jar include a copy of all avro classes, and hive-exec-0.10.0-cdh4.2.0-sources.jar doesn't include the source codes of Avro.

AvroMultipleOutputs in 1.7.3 doesn't support different outputs have different output schema. See Avro-1266.

Thursday, June 13, 2013

Avro

It took me a while to figure out how to write an Avro file in MapReduce which can be imported into Hive and Impala.

There are a lot of OutputFormat in avro 1.7.3: AvroOutputFormat, AvroKeyOutputFormat, AvroKeyValueOutputFormat and AvroSequenceFileOutputFormat. Which one can be imported into Hive? You should use AvroKeyOutputFormat in MapReduce Job to output Avro Container Files.
You cannot specified any above output format in hive create table "stored as" clause because they don't implement HiveOutputFormat.
You need to set the schema using AvroJob.setOutputKeySchema

You have three ways to set avro compression codec as follows:

        AvroJob.setOutputKeySchema(job, getAvroSchema(schemaFile))
        job.setOutputFormatClass(classOf[AvroKeyOutputFormat[GenericRecord]])
        job.setOutputKeyClass(classOf[AvroKey[GenericRecord]])
        job.setOutputValueClass(classOf[NullWritable])
        FileOutputFormat.setCompressOutput(job, true)
        // You can us any of the following three ways to set compression
        // FileOutputFormat.setOutputCompressorClass(job, classOf[SnappyCodec])
        // job.getConfiguration().set(AvroJob.CONF_OUTPUT_CODEC, DataFileConstants.SNAPPY_CODEC)
        // job.getConfiguration().set(AvroJob.CONF_OUTPUT_CODEC, CodecFactory.snappyCodec().toString());

        FileOutputFormat.setOutputPath(job, new Path(outputPath))

Be careful to check Cloudera Manager compression settings because it may override your settings. If you find you cannot compress avro files using the above code, you can verify if the compress is enabled in your mapper or reducer like this:

class AvroGenericRecordMapper extends TableMapper[AvroKey[GenericRecord], NullWritable] {

  type Context = Mapper[ImmutableBytesWritable, Result, AvroKey[GenericRecord], NullWritable]#Context
  private var converter: AvroConverter = _

  // val outKey = new LongWritable(0L)
  val outVal = NullWritable.get()
  val outKey = new AvroKey[GenericRecord]()

  override def setup(context: Context) {
    converter = new AvroConverter(AvroJob.getOutputKeySchema(context.getConfiguration()))
    
    import org.apache.hadoop.mapreduce.TaskAttemptContext
    println("compress: " + FileOutputFormat.getCompressOutput(context.asInstanceOf[TaskAttemptContext]))
    println("codec: " + context.getConfiguration().get(AvroJob.CONF_OUTPUT_CODEC))
  }

  override def map(key: ImmutableBytesWritable, value: Result, context: Context) {
    outKey.datum(converter.convert(value))
    context.write(outKey, outVal)
  }
}

This mapper tries to dump Hbase table into avro files. AvroConverter is an class to convert Hbase Result to Avro GenericRecord.

Follow the example on this page: https://cwiki.apache.org/confluence/display/Hive/AvroSerDe. Unfortunately if you use "Avro Hive" to search, google shows your this page https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html which has a wrong example, and you will get error message like:
```
FAILED: Error in metadata: Cannot validate serde: org.apache.hadoop.hive.serde2.AvroSerDe
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
```
What's wrong? The serde name should be org.apache.hadoop.hive.serde2.avro.AvroSerDe
You don't have to define the columns because it can get from avro schema.

Friday, June 7, 2013

Fix "value parse is not a member of object org.joda.time.DateTime"

When you search this error message, you will find the answer that you missed joda-convert in your dependencies. The answer is correct.

[error] MySpec.scala:27: value parse is not a member of object org.joda.time.DateTime
[error]       val from = DateTime.parse("2013-03-01")
[error]                           ^

However, our project still reported this error even we were 100% sure joda-convert was included in the classpath. Our project has transitive dependencies on joda-time:joda-time:2.1 and org.joda:joda-convert:1.2 through play:play_2.10:2.1.1. If typing "show test:dependency-classpath" in Play console, joda-convert is in the output, and the jar in ivy repository is ok.

It turns out that we have another dependency providing org.joda.time.DateTime: org.jruby:jruby-complete:1.6.5, a transitive dependency through org.apache.hbase:hbase:0.94.2-cdh4.2.0. If joda-time comes first, you have no problem, otherwise, you see the error. Because it doesn't always happen, you might be lucky not experience this problem at all.

The fix is really simple, excluding jruby-complete from the dependency:

"org.apache.hbase" % "hbase" % hbaseVersion exclude ("org.slf4j", "slf4j-log4j12") exclude ("org.slf4j", "slf4j-api") exclude
      ("org.jruby", "jruby-complete")

Wednesday, June 5, 2013

Impala build steps on CentOS 6.3

Build boost-1.42.0

Before building impala

change be/CMakeLists.txt. I removed all boost RPMs and built boost libraries from sources. boost_date_time will be /usr/local/lib/libboost_date_time-mt.*. The build failed without this change. If you have boost RPMs 1.41 installed, you may not need to change this. But the build will fail with other issues.

diff --git a/be/CMakeLists.txt b/be/CMakeLists.txt
index c14bd31..cd5abac 100644
--- a/be/CMakeLists.txt
+++ b/be/CMakeLists.txt
@@ -224,7 +224,7 @@ set (IMPALA_LINK_LIBS
   ${LIBZ}
   ${LIBBZ2}
   ${AVRO_STATIC_LIB}
-  -lrt -lboost_date_time
+  -lrt -lboost_date_time-mt
 )
 
 if ("${CMAKE_BUILD_TYPE}" STREQUAL "CODE_COVERAGE")

change build_public.sh to build release version and don't have to put -build_thirdparty in command line:

diff --git a/build_public.sh b/build_public.sh
index 6ea491b..28b445a 100755
--- a/build_public.sh
+++ b/build_public.sh
@@ -23,8 +23,8 @@ set -e
 # Exit on reference to uninitialized variable
 set -u
 
-BUILD_THIRDPARTY=0
-TARGET_BUILD_TYPE=Debug
+BUILD_THIRDPARTY=1
+TARGET_BUILD_TYPE=Release
 
 for ARG in $*
 do

After building, run shell/make_shell_tarball.sh. This can generate a shell/build dir to have all files for impala-shell.
Prepare hadoop, hbase and hive config files, copy from /var/run/cloudera-scm-agent/process.

change bin/set-classpath.sh like this

CLASSPATH=\
$HOME/hadoop/hadoop-conf:\
$HOME/hadoop/hbase-conf:\
$HOME/hadoop/hive-conf:\
#$IMPALA_HOME/fe/src/test/resources:\
#$IMPALA_HOME/fe/target/classes:\
#$IMPALA_HOME/fe/target/dependency:\
#$IMPALA_HOME/fe/target/test-classes:\
$IMPALA_HOME/fe/target/impala-frontend-0.1-SNAPSHOT.jar:\
${HIVE_HOME}/lib/datanucleus-core-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-enhancer-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-rdbms-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-connectionpool-2.0.3.jar:

for jar in `ls ${IMPALA_HOME}/fe/target/dependency/*.jar`; do
  CLASSPATH=${CLASSPATH}:$jar
done

export CLASSPATH

Otherwise, you might see if you don't include impala-frontend.jar

Exception in thread "main" java.lang.NoClassDefFoundError: com/cloudera/impala/common/JniUtil
Caused by: java.lang.ClassNotFoundException: com.cloudera.impala.common.JniUtil

Or this if you don't have hadoop-conf in the path

E0605 09:32:23.236434  5272 impala-server.cc:377] Unsupported file system. Impala only supports DistributedFileSystem but the LocalFileSystem was found. fs.defaultFS(file:///) might be set incorrectly
E0605 09:32:23.236655  5272 impala-server.cc:379] Impala is aborted due to improper configurations.

Impalad_flags

-beeswax_port=21001
-fe_port=21001
-be_port=22001
-hs2_port=21051
-enable_webserver=true
-mem_limit=-1
-webserver_port=25001
-state_store_subscriber_port=23001
-default_query_options
-log_filename=impalad
-use_statestore=false
-nn=5K04.corp.pivotlink.com
-nn_port=8020

create a tarball of impala build because no such a open-source script.

tar zcvf impala.tar.gz impala --exclude="*.class" --exclude="*.o" --exclude="impala/thirdparty" --exclude="impala/.git" --exclude="*.java" --exclude="*.cpp" --exclude="*.h" --exclude="expr-test"

start impalad

cd impala_home
export IMPALA_HOME=$PWD
bin/start-impalad.sh -build_type=release --flagfile=impalad_flags_path

start impala-shell

cd impala_home
export IMPALA_HOME=$PWD
export IMPALA_SHELL_HOME=$PWD/shell/build/impala-shell-1.0.1
$IMPALA_SHELL_HOME/impala-shell -i impalad-host:21001

Tuesday, June 4, 2013

Build boost for Impala in CentOS 6.3

CentOS 6.3 has only rpm for boost_1.41.0 at the time I made the build. I had to build boost from source by myself.

Clean up the old installation. Find all boost installations, then remove all old versions.

$ rpm -qa | grep boost
$ yum remove boost boost-filesystem ...

Download boost tarball and expand into a directory.
Make a build. Using tagged layout, this will generate /usr/local/lib/libboost_filesystem-mt.so. If you use --layout=system, /usr/local/lib/libboost_filesystem.so is created. Don't use --build_type=complete because the build takes too long.

cd boost_1.42.0
./bootstrap
sudo ./bjam --layout=tagged install