Recently I wrote a post for Bad performance of Hive meta store for tables with large number of partitions. I did tests in our environment. Here is what I found:
- Don't configure a hive client to access remote MySQL database directly as follows. The performance is really bad, especially when you query a table with a large number of partitions.
javax.jdo.option.ConnectionURL jdbc:mysql://mysql_server/hive_meta javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName hive_user javax.jdo.option.ConnectionPassword password
- On database server, use the same configuration as above
- Start the hive metasore service
hive --service metastore # If use CDH yum install hive-metastore /sbin/service hive-metastore start
hive.metastore.uris thrift://mysql_server:9083
ERROR conf.HiveConf: Found both hive.metastore.uris and javax.jdo.option.ConnectionURL Recommended to have exactly one of those config keyin configurationThe reason is: when Hive does partition pruning, it will read a list of partitions. The current metastore implementation uses JDO to query the metastore database:
- Get a list of partition names using db.getPartitionNames()
- Then call db.getPartitionsByName(List<Strin> partNames). If the list is too large, it will load in multiple times, 300 for each load by default. The JDO calls like this
- For one MPartition object.
- Send 1 query to retrieve MPartition basic fields.
- Send 1 query to retrieve MStorageDescriptors
- Send 1 query to retrieve data from PART_PARAMS.
- Send 1 query to retrieve data from PARTITION_KEY_VALS.
- ...
- Totally 10 queries for one MPartition. Because MPartition will be converted into Partition before send by, all fields will be populated
- If one query takes 40ms in my environment. And you can calculate how long does it take for thousands partitions.
- Using remote Hive metastore service, all those queries happens locally, it won't take that long for each query, so you can get performance improved significantly. But there are still a lot of queries.
I also wrote ObjectStore using EclipseLink JPA with @BatchFetch. Here is the test result, it will at least 6 times faster than remote metastore service. It will be even faster.
Partitions | JDO Remote MySQL |
Remote Service
| EclipseLink Remote MySQL |
10 | 6,142 | 353 | 569 |
100 | 57,076 | 3,914 | 940 |
200 | 116,216 | 5,254 | 1,211 |
500 | 287,416 | 21,385 | 3,711 |
1000 | 574,606 | 39,846 | 6,652 |
3000 | 132,645 | 19,518 |
No comments:
Post a Comment