Tuesday, August 13, 2013

Output a stream into multiple files in the specified percentages.

I recently finished a project which outputs JDBC results randomly into multiple files in the specified percentages. For example, I want to generate three files, which have 10%, 20%, and 70% of the total rows respectively and which file is chose for a row is randomly picked. The data is dumped from Hive/Impala through JDBC, and the result could have million rows.

The problem seems too easy. The first algorithm jumped in my mind is: generate a random number for each row between 0 and 1; If the value locates in 0-0.1, write to the first file, and 0.1 to 0.3, write to the second file, and larger than 0.3 to the third file. This method works, but unfortunately, not perfect. The problem is that the output rows may not strictly satisfy the percentage requirement. The random numbers generated in Java/Scala is uniformly distributed, but doesn't means that it will have exactly 100 numbers between 0 to 0.1, 200 between 0.1 to 0.3, and 700 between 0.3 to 1.0. There may be 3% and 5% errors.

The next algorithm I got is:

  • Output the row into a temporary file, then the total number of rows n is known. 
  • Generate an array with 0.1*n 1s, 0.2*n 2s and 0.7*n 3s, shuffle them. 
  • Read line by line from the temporary file, and write to the file according to the number in the array.
This method can generate rows satisfied the percentage requirements, but is definitely bad because I need a temporary file and may be a huge array in memory.

I finally figured out a better way which doesn't need a temporary file and a large array: Just buffer 100 rows, once the buffer is full, shuffle the rows and write them into the files according to the percentages. Usually there are still rows in the buffer when the stream is closed, you cannot just write them according to the percentages because the errors when you dump the buffer each time can be accumulated to a large number. To make the number of rows in each file satisfy the requirement strictly, you need to handle the last buffered rows carefully. Because the total number of rows is an integer, as well as are the number of rows in each file, you cannot get exactly the specified percentages. Is there a way to make x1+...+xn = total and x1/total, ..., xn/total are best approximate to the specified percentages?  

1 comment:

  1. Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop & Spark Training in Hyderabad will help you to enter big data hadoop & spark technology.