My Tech Notes: August 2013

Friday, August 16, 2013

Fedora 19 XBMC Autologin in XFCE

Create user xbmc and set password
```
useradd -g media xbmc
passwd xbmc 
```
log on as xbmc, choose XBMC as session instead of Xfce session
modify /etc/lightdm/lightdm.conf, I added the following into section [LightDM]
```
autologin-user=xbmc
autologin-session=XBMC
```
and into section [SeatDefaults].
```
greeter-show-manual-login=false
autologin-user=xbmc
autologin-session=XBMC
```
Not sure if I need to repeat user/session in both sections, but it works.
It seems that you have to give xbmc a password, otherwise autologin doesn't work.

Wednesday, August 14, 2013

Scala and Java collection interoperability in MapReduce job

In Scala, you can import scala.collections.JavaConversions._ to make collections interoperabable between Scala and Java. for example

scala.collection.Iterable <=> java.lang.Iterable

Usually I prefer Scala collection API because it is concise and powerful. But be careful, this may not work in all cases. I encountered this problem when I wrote a Scala MapReduce job:

// do something on values(1)
values.drop(1).foreach { v =>
  ...
}

The code tries to handle the first element and the rest differently. This piece of code worked in the combiner perfectly, but failed in the reducer. Both the combiner and reducer use values.drop(1).foreach The reason is, I believe, that the iterable in reducer is based on a file, the file position cannot go back. When you call drop(1) in Scala, the file position moves to next, then two elements are actually dropped.

Don't call filesystem.close in Hadoop

Recently the MapReduce jobs in my project suddenly failed. It turned out that my colleague added a line in the code which closes the filesystem.

val path = new Path("/user/bewang/data")
val fs = path.getFileSystem(conf)
fs.mkdirs(path)
fs.close

Nothing seems wrong. We were told to clean up the mess you created. After using it, just close it. Is it a good practice? Unfortunately it doesn't work in Hadoop world. Actually Hadoop client manage all connections to Hadoop cluster. If you call fs.close(), the connections to the cluster are broken, and you cannot do anything after that. Don't call close, let Hadoop handle it.

Tuesday, August 13, 2013

Output a stream into multiple files in the specified percentages.

I recently finished a project which outputs JDBC results randomly into multiple files in the specified percentages. For example, I want to generate three files, which have 10%, 20%, and 70% of the total rows respectively and which file is chose for a row is randomly picked. The data is dumped from Hive/Impala through JDBC, and the result could have million rows.

The problem seems too easy. The first algorithm jumped in my mind is: generate a random number for each row between 0 and 1; If the value locates in 0-0.1, write to the first file, and 0.1 to 0.3, write to the second file, and larger than 0.3 to the third file. This method works, but unfortunately, not perfect. The problem is that the output rows may not strictly satisfy the percentage requirement. The random numbers generated in Java/Scala is uniformly distributed, but doesn't means that it will have exactly 100 numbers between 0 to 0.1, 200 between 0.1 to 0.3, and 700 between 0.3 to 1.0. There may be 3% and 5% errors.

The next algorithm I got is:

Output the row into a temporary file, then the total number of rows n is known.
Generate an array with 0.1*n 1s, 0.2*n 2s and 0.7*n 3s, shuffle them.
Read line by line from the temporary file, and write to the file according to the number in the array.

This method can generate rows satisfied the percentage requirements, but is definitely bad because I need a temporary file and may be a huge array in memory.

I finally figured out a better way which doesn't need a temporary file and a large array: Just buffer 100 rows, once the buffer is full, shuffle the rows and write them into the files according to the percentages. Usually there are still rows in the buffer when the stream is closed, you cannot just write them according to the percentages because the errors when you dump the buffer each time can be accumulated to a large number. To make the number of rows in each file satisfy the requirement strictly, you need to handle the last buffered rows carefully. Because the total number of rows is an integer, as well as are the number of rows in each file, you cannot get exactly the specified percentages. Is there a way to make x1+...+xn = total and x1/total, ..., xn/total are best approximate to the specified percentages?