Friday, November 30, 2012

Bad performance of Hive meta store for tables with large number of partitions

Just found this article Batch fetching - optimizing object graph loading

We have some tables with 15K ~ 20K partitions. If I run a query scanning a lot of partitions, Hive could use more than 10 minutes to commit the mapred job.

The problem is caused by ObjectStore.getPartitionsByNames when Hive semantic analyzer tries to prune partitions. This method sends a lot of queries to our MySQL database to retrieve ALL information about partitions. Because MPartition and MStroageDescriptor are converted into Partition and StorageDescriptor, every field will be accessed during conversion, in other words, even the fields has nothing to do with partition pruning, such as BucketCols. In our case, 10 queries for each partition will be sent to the database and each query may take 40ms.

This is known ORM 1+N problem. But it is really bad user experience.

Actually we assembly Partition objects manually, it would only need about 10 queries for a group of partitions (default size is 300). In our environment, it only needs 40 seconds for 30K partitions: 30K / 300 * 10 * 40.

I tried to this way:

  1. Fetch MPartition with fetch group and fetch_size_greedy, so one query can get MPartition's primary fields and MStorageDescriptor cached.
  2. Get all descriptors into a list "msds", run another query to get MStorageDescriptor with filter like "msds.contains(this)", all cached descriptors will be refreshed in one query instead of n queries.

This works well for 1-1 relations, but not on 1-N relation like MPartition.values. I didn't find a way to populate those fields in just one query.

Because JDO mapping doesn't work well in the conversion (MPartition - Partition), I'm wondering if it is worth doing like this:

  1. Query each table in SQL directly PARTITIONS, SDS, etcs.
  2. Assembly Partition objects

This is a hack and the code will be really bad. But I didn't find JDO support "FETCH JOIN" or "Batch fetch".

2 comments:

  1. JDO mapping offers complete flexibility in this respect, and you can obviously specify whichever fields you want fetching. Equally obvious is that any such fetch is a hint to the *implementation* to try to fetch those fields if sensible with that datastore.

    With RDBMS it is not so sensible to even try to fetch a multivalued field of an object; large joins of multiple tables and then get rows with columns duplicated. It is not even mentioned in the JPA spec to be able to specify a multivalue field in a SELECT clause of JPQL (for this very reason). There are other ways of getting collection information. JDO itself has no such limitation, it is just down to how a particular implementation provides some feature, and how the people who made use of that implementation wrote their code.

    Obviously the performance of the DataNucleus JDO/JPA implementation is comparable with the best ORMs in this respect when used sensibly, see http://s3-eu-west-1.amazonaws.com/presentations2012/50_presentation.pdf

    ReplyDelete
  2. Hi Andy,

    I did rewrote Hive MetaStore code with EclipseLink @BatchFetch. It IS much faster (6x) than JDO in this case because the number of queries sent to database is limited to 10 instead of 300*10 for JDO. To make it simple, MPartition has a collection called parameters, when 300 partitions are retrieved from DBMS, Hive Metastore DOES need to load parameters, so lazy-load doesn't work in this case. To load parameters, JDO has to send 300 queries to RDBMS, to retrieve parameters for each MPartition. But if you use @BatchFetch, EclipseLink actually collect all MPartition's ids, and send one query to RDBMS to retrieve all parameters and then construct the collections in all loaded MPartition's objects. Definitely, @BatchFetch works more efficiently.

    class MPartition {
    ...
    Map parameters;
    ...
    }

    @BatchFetch

    ReplyDelete