Friday, May 27, 2016

Bring back google-chrome after upgrading to CentOS 6.8 and Chrome 51.

I don’t know which one is root cause: upgrading to CenOS 6.8 or Chrome 51. I used on to install Google chrome on my CentOS 6 VirtualBox VM. It worked very well until this upgrade. If I ran google-chrome, the window popped up, but it is almost black.
I use the following command to investigate this problem
google-chrome --disable-plugins --disable-extensions --user-data-dir=/tmp/chrome-user-dir --enable-logging --log-level=0
  • disable all plugins and extensions
  • use a new user dir
  • enable logs
Also checking the chrome process --type=gpu-process as parameter
$ ps -ef | grep chrome
bwang  1358  1284  1 10:58 pts/6    00:00:00 /opt/google/chrome/chrome --enable-features=... --disable-features=... --type=gpu-process --channel=1284.0.1688276239 --enable-logging --log-level=0 --window-depth=24 --user-data-dir=/tmp/chrome-user-dir --supports-dual-gpus=false --gpu-driver-bug-workarounds=4,54 --gpu-vendor-id=0x80ee --gpu-device-id=0xbeef --gpu-driver-vendor=Chromium --gpu-driver-version=1.9 --user-data-dir=/tmp/chrome-user-dir --enable-logging --log-level=0 --v8-natives-passed-by-fd --v8-snapshot-passed-by-fd
The log file /tmp/chrome-user-dir/chrome_debug.log shows
[2945:2945:0527/] [.CommandBufferContext.DisplayCompositor-0x3d53365d63c0]GL ERROR :GL_INVALID_ENUM : glTexImage2D: <- error from previous GL command
[23:23:0527/] MessageAttachmentSet destroyed with unconsumed descriptors: 0/1
[2945:2945:0527/] [.CommandBufferContext.CompositorWorker-0x3d53365d6280]GL ERROR :GL_INVALID_ENUM : GLES2DecoderImpl::DoBindTexImage2DCHROMIUM: <- error from previous GL command
[2945:2945:0527/] [.CommandBufferContext.CompositorWorker-0x3d53365d6280]GL ERROR :GL_INVALID_VALUE : ScopedTextureBinder::dtor: <- error from previous GL command
Looks like google-chrome use gpu for acceleration. So the solution is simple google-chrome --disable-gpu brings back chrome on my CentOS 6.8 VM.

Wednesday, May 18, 2016

How to create .epub and .mobi version of Gradle User Guide?

Gradle User Guide is written using docbook, and gradle build already have single HTML and pdf built. But I really want to load it into my kindle. Because docbook supports converting docbook to epub and epub3, I want to build it by myself.
You need to install docbook-xsl. On cygwin, I installed 1.77.1-1
$ cygcheck -c | grep docbook
build-docbook-catalog        1.5-2              OK
docbook-xsl                  1.77.1-1           OK

$ cygcheck -l docbook-xsl | grep epub
You’d better to read epub3/README, which describes the steps how to build a epub eBook from docbook. The command looks like this:
 xsltproc --stringparam base.dir ebook/OEBPS/ --xinclude /usr/share/sgml/docbook/xsl-stylesheets/epub3/chunk.xsl ../gradle/subprojects/docs/build/src/userguide.xml
One thing you need to pay more attention, you must have the last slash of ebook/OEBPS/. The above command will generate mimetype and META-INF in the directory ebook.
$ ls ebook
META-INF/  mimetype  OEBPS/
If you don’t append “/“, the command will create a directory ebook/OEBPS..
To build Gradle User Guide using docbook to epub, you need to do as follows:
  • You need to add cols="?" to <tgroup in the xml files in ~/gradle/subprojects/docs/src/docs/userguide. Otherwise, you will encounter the error Error: CALS tables must specify the number of columns. You can search the xml file using <tgroup> and cols="3" or cols="4".
    grep -R '<tgroup' ~/gradle/subprojects/docs/src/docs/userguide
  • You need to make build docs:userguide first. Because the document has a lot of sample codes, they are only added when you do a build. If you use userguide.xml in gradle/subprojects/docs/src/docs/userguide/userguid.xml, you won’t see the sample codes in the ebook.
  • After xlstproc, just run zip -r -X ../gradle-user-guide.epub mimetype META-INF OEBPS in ebook.
  • If you want .mobi for Kindle, convert the epub file in Calibre.

Friday, May 13, 2016

How to make @timestamp using GMT when using Fluentd, Elasticsearch and Kibana?

My log is a JSON one-liner output by a Node.js application, there is a field called “time” which is GMT time.
{ "req": {}, "time":"2016-05-12T19:18:38.123Z" }
I want to keep the timestamp in GMT in Kibana. But it is not a straight forward thing as I thought. It took me couple of hours to make the timestamp work correctly using Fluentd, Elasticsearch and Kibana.
I use in_tail and fluent-plugin-elasticsearch to parse the log and load into Elasticsearch, and I search the logs using Kibana.
Here is my fluentd config file.
  @type tail
  format json

  read_from_head true
  path <path>/debug.log
  pos_file /var/run/td-agent/pos/debug.log.pos

  keep_time_key true
  time_key time
  time_format "%FT%T.%L%z"

  refresh_interval 10s

  tag debug
<match debug>
  @type elasticsearch
  hosts                my-es-server-1,my-es-server-2

  logstash_format      true
  logstash_prefix        debug
  utc_index  true

  time_key  time
  time_key_format      %FT%T.%L%z
  • keep_time_key, time_key and time_format are necessary in in_tail. Because the default value of time_key is time, and keep_time_key is true, fluentd will always parse the timestamp from your json message.
    • If you don’t put keep_time_key, field time will be removed, and the timestamp will be in the timezone of the host where td-agent is running.
    • If you don’t give time_format, the default time parser cannot parse this format because the time has milliseconds, your @timestamp will be wrong.
  • in elasticsearch
    • you need to put time_key. Fluentd will copy time to @timestamp, so @timestamp will have the exact same UTC string as time.
    • time_key_format will be used to parse the time and use it to generate logstash index name when logstash_format=true and utc_index=true. So the index name like debug-2016.05.12 will match the times in your log.
  • In Kibana, you might see the timestamp is actually shown in your local timezone like ‘PDT’. You need to go to “Settings -> Advanced -> dateFormat:tz”, change the default value “Browser” to “GMT”. So that the timestamps will be all GMT times.

Monday, May 9, 2016

Spark Cassandra Connector and DataFrame

When you write a DataFrame to a cassandra table, be careful to use SaveMode.Overwrite. In spark-cassandra-connector-1.6.0-M2, TRUNCATE $keyspace.$table will be called. See the code in CassandraSourceRelation.scala.
I did observe something weird when I use the following code to write a data frame to a cluster of Cassandra 2.1.8:
  .options(Map("table" -> table, "keyspace" -> keyspace))
After the scheduled spark job finishes, in CQLSH, the table is empty when running select * from keyspace.table limit 10. The same results if I change consistency level to QUORUM, and even ALL. It might take some time, then the query returns the results.
If I start the job manually from the command line, however, most of time the query can return the results.
If you check the CQL document for TRUNCATE, setting consistency level to ALL is required.
Note: The consistency level must be set to ALL prior to performing a TRUNCATE operation. All replicas must remove the data.
I don’t think the consistency level is changed before calling TRUNCATE $keyspace.$table in spark-cassandra-connector. The default consistency level is LOCAL_QUORUM. That might be the root cause.

Tuesday, May 3, 2016

How to resolve spark-cassandra-connector's Guava version conflict in spark-shell

In my blog How to resolve spark-cassandra-connector Guava version conflicts in Yarn cluster mode, I explained how to resolve Guava version issue in Yarn cluster mode. This blog covers how to do it in spark-shell.
The first thing is, when you start spark-shell with --master yarn, you actually run in yarn-client mode. Unfortunately my method for Yarn cluster mode won’t work. You may still get an exception as below:
Caused by: java.lang.NoSuchMethodError:;Lcom/google/common/util/concurrent/FutureFallback;Ljava/util/concurrent/Executor;)Lcom/google/common/util/concurrent/ListenableFuture;
What’s wrong? If you log on the data node, and check the for you Yarn application, you will find guava-16.0.1.jar is in the first one in the classpath
Here is the trick: you need to add Guava jar to --files in your command
spark-shell \
  --master yarn-cluster \
  --driver-class-path <local path of guava-16.0.1.jar> \
  --conf spark.executor.extraClassPath=./guava-16.0.1.jar \
  --jars <local path of guava-16.0.1.jar> \
  --files <local path of guava-16.0.1.jar> \
Sounds weird? You can do this test, then you will understand why. Run spark-shell command, when you see the prompt, don’t do anything, log on the data nodes where your application’s spark executors are running. Find the application’s cache, what will you find?
# ls /grid/0/yarn/nm/usercache/bwang/appcache/application_1459869234031_5503/container_e45_1459869234031_5503_01_000004/
container_tokens               __spark__.jar          tmp
Where are those jars listed in —jars? The answer is those jars are copied until you start some action of RDD or DataFrame in the spark-shell. Unfortunately, the JVM of the executor is already started, and the JVM might use the older version of Guava already.
If you add Guava jar in --files, the jar will be copied to the executor’s container. And guava-16.0.1.jar will be chosen over the older version of Guava.

Friday, April 15, 2016

How to resolve spark-cassandra-connector Guava version conflicts in Yarn cluster mode

When you use spark-cassandra-connector, you will encounter this problem “Guava version conflicts” when you submit your job using Yarn cluster mode. spark-cassandra-connector usually use the latest Guava version 16.0.1, which has some new methods could not be found in the old version of Guava, e.g., 11.0.2. It is A BIG Headache when you try to resolve this problem.
Here is how you can resolve this without building something specially.
I think everyone might already have the idea: Put guava-16.0.1.jar before guava-11.0.2.jar in the classpath. But how can we achieve this when you run as YARN cluster mode?
You Hadoop cluster might already have the Guava jar. If you use CDH, try this
find -L /opt/cloudera/parcels/CDH -name "guava*.jar". If you like use that jar, you can resolve this problem by adding
  --master yarn-cluster
  --conf spark.driver.extraClassPath=<path of guava-16.0.1.jar>
  --conf spark.executor.extraClassPath=<path of guava-16.0.1.jar>
extraClassPath allow you prepend the jars in the class path.
If you could not find the version of Guava in you cluster, you can just include the jar by yourself
  --master yarn-cluster
  --conf spark.driver.extraClassPath=./guava-16.0.1.jar
  --conf spark.executor.extraClassPath=./guava-16.0.1.jar
  --jars <path of guava-16.0.1.jar>
In --jars, you actually tell spark how to find the jar, so you need to provide the full path of the jar. When spark starts in Yarn cluster mode, the jar will be shipped to the container in NodeManager, in that everything will the current directory where the executor starts, you only need to tell it is the current working directory in extraClassPath.
If you use CDH, all hadoop jars are automatically added when you run a Yarn application. Take a look of when you job is running, you will see something like below.
Here is how you can find the
  • Find the hose where one of the executors is running
  • Run this command find -L /yarn -path "*<app_id>*" -name "launch*"
There is a Yarn configuration yarn.application.classpath. If you like, you can prepend an entry for Guava.
You know spark-submit have some messy convention:
  • --jars is separated by comma “,
  • extraClassPath is separated by column “:
  • --driver-class-path is separated by comma “,
I was in a hurry to write this blog. I might miss something or I assume you know a lot. If something is not clear, let me know and I will fix it.

Friday, January 29, 2016

Using Hadoop distcp copy files from a SFTP server

Here are the steps how I use "hadoop distcp" to copy files from a SFTP server to HDFS:
  • Clone hadoop-filesystem-sftp at
  • There is a bug in hadoop-filesystem-sftp which may block you running distcp correctly when you have special characters in the file names which should be escaped, e.g. ":". The fix is very simple. You can find line 331 in, and encode the sftpFile.filename.
      for (SFTPv3DirectoryEntry sftpFile : sftpFiles) {
       String filename = URLEncoder.encode(sftpFile.filename, "UTF-8");
       if (!"..".equals(filename) && !".".equals(filename))
        fileStats.add(getFileStatus(sftpFile.attributes, new Path(path, filename).makeQualified(this)));
  • If using password, it might be easy, but your password to SFTP server will be public because it will be shown in MapReduce job configuration.
  • hadoop-filesystem-sftp using ganymed-ssh-2 which only supports authentication using password or keyfile.
  • Here is how to set up passwordless SSH, you need permission to log on the sftp server. Create a ssh key pair using "ssh-keygen". Make sure you don't overwrite your current key pair in $HOME/.ssh
    $ ssk-keygen -f ${distcp_ssh}/keyfile
    $ ssh-keygen -F sftp-server-name -f ${distcp_ssh}/known_hosts
  • Copy the public key to the sftp server.
    • Copy ${distcp_ssh} to all data nodes and the client node. And you need to set the dir read only by yarn.
    • On the client node. You need to set this dir readable by the user you use to run "hadoop distcp"
    • The reason doing like this is that hadoop-filesystem-sftp trying to use ${user.home} for the default path for the key file and known_hosts. And the more important reason is that the task is run as yarn instead of the user running the command on each data node. Unfortunately, some one can write a mapreduce job to grab your id_rsa key file and gain the access to SFTP server.
  • hadoop distcp -D fs.sftp.user=username -D fs.sftp.key.file=${distcp_ssh}/id_rsa -D fs.sftp.knownhosts=${distcp_ssh}/known_hosts -libjars hadoop-filesystem-sftp-0.0.1-SNAPSHOT-jar-with-dependencies.jar sftp://sftp-server/src-path hdfs://namenode/target-path
  • WARNING: Don't use this method unless you have to.