Thursday, February 27, 2014

Yarn MapReduce Log Level

When you write a Pig or Hive UDF, a debug log may be very useful. You don't have to ask your administrator for help. Setting property 'mapreduce.map.log.level' or 'mapreduce.reduce.log.level' to 'DEBUG', that is it.
set mapreduce.map.log.level 'DEBUG'

register '/home/bwang/workspace/pig-scratch/target/pig-scratch-0.0.1-SNAPSHOT.jar';

define asByteArray com.mycompany.pig.SaveObjectAsByteArray();

d_val = load '/tmp/double.txt' as (id: chararray, val: double);

dba_val = foreach d_val generate flatten(asByteArray(*)), 0 as pos;

g = group dba_val all;

dump g;

But this is not perfect in that you will see a lot of Hadoop DEBUG log information too when you check the syslog of a map task.

Can I just output the DEBUG log for my own class? log4j definitely supports that, but it is not straight forward in Yarn MapReduce job as you think.

If you read MRApps.addLog4jSystemProperties, you will find that the log4j.configuration is actually hard coded to 'container-log4j.properties', which is packed in hadoop-yarn-server-nodemanager-2.0.0-cdh4.5.0.jar.

I found a way to fool NodeManager to achieve setting the log level for my UDF class:

  • Find container-log4j.properties in maven dependencies.
  • Copy the content to a property file, e.g., as_byte_array.log4j.properties.
  • Add 'log4j.logger.com.mycompany.pig=${com.mycompany.pig.logger}' into as_byte_array.log4j.properties.
  • Build a package and make sure as_byte_array.log4j.properties in your udf jar.
  • Change the pig script like this:
    set mapreduce.map.log.level 'INFO,CLA -Dlog4j.configuration=as_byte_array.log4j.properties -Dcom.mycompany.pig.logger=DEBUG'
    
    This is totally a HACK. If you check 'ps -ef | grep mycompany' when the map task is running, you will see something like this:
    /usr/java/jdk1.7.0_25/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx825955249 -Djava.io.tmpdir=/yarn/nm/usercache/bwang/appcache/application_1393480846083_0011/container_1393480846083_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.mapreduce.container.log.dir=/var/log/hadoop-yarn/container/application_1393480846083_0011/container_1393480846083_0011_01_000002 -Dyarn.app.mapreduce.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dlog4j.configuration=as_byte_array.log4j.properties -Dcom.mycompany.pig.logger=DEBUG,CLA org.apache.hadoop.mapred.YarnChild 127.0.0.1 60261 attempt_1393480846083_0011_m_000000_0 2
    
    Basically I inject a new log4j.configuration to point to my own log4j.properties, which overwrites container-log4j.properties because it appears behind it. And "-Dcom.mycompany.pig.logger=DEBUG" let me control the log level for my UDF.

No comments:

Post a Comment