Thursday, December 27, 2012

Have to read all from Ruby PTY output

I wrote a Ruby script to call ssh-copy-id to distribute my public key to a list of hosts. It took me a while to make it work. The code is simple: read the password, and call ssh-copy-id for each host to copy the public key. Of course, the code doesn't consider all scenarios like the key is already copied, and wrong password. The tricky parts are "cp_out.readlines" and "Process.wait(pid)". If you don't read all the data from cp_out (comment out cp_out.readlines), the spawned process won't return.
#!/usr/bin/env ruby
require 'rubygems'

require 'pty'
require 'expect'
require 'io/console'

hosts_file = ARGV[0] || "hosts"

print "Password:"
password = $stdin.noecho(&:gets)
password.chomp!
puts

$expect_verbose = true
File.open(hosts_file).each do |host|
  host.chomp!
  print "Copying id to #{host} ... "
  begin
    PTY.spawn("ssh-copy-id #{host}") do |cp_out, cp_in, pid|
      begin
        pattern = /#{host}'s password:/
        cp_out.expect(pattern, 10) do |m|
          cp_in.printf("#{password}\n")
        end
        cp_out.readlines
      rescue Errno::EIO
      ensure
        Process.wait(pid)
      end
    end
  rescue PTY::ChildExited => e
    puts "Exited: #{e.status}"
  end
  status = $?
    if status == 0 
      puts "Done!"
  else
    puts "Failed with exit code #{status}!"
  end
end

Tuesday, December 11, 2012

Puppet require vs. include vs. class

According to puppet reference include and require:

  • Both include and require are functions;
  • Both include and require will "Evaluate one or more classes";
  • Both include and require cannot handle parametered classes
  • require is a superset of include, because it "adds the required class as a dependency";
  • require could cause "nasty dependency cycle";
  • require is "largely unnecessary"; See puppet language guide.
    Puppet also has a require function, which can be used inside class definitions and which does implicitly declare a class, in the same way that the include function does. This function doesn’t play well with parameterized classes. The require function is largely unnecessary, as class-level dependencies can be managed in other ways.
  • We can include a class multiple times, but cannot declare a class multiple times.
    class inner {
      notice("I'm inner")
    
      file {"/tmp/abc":
        ensure => directory
      }
    }
    
    class outer_a {
      # include inner
      class { "inner": }
    
      notice("I'm outer_a")
    }
    
    class outer_b {
      # include inner
      class { "inner": }
    
      notice("I'm outer_b")
    }
    
    include outer_a
    include outer_b
    
    Duplicate declaration: Class[Inner] is already declared in file /home/bewang/temp/puppet/require.pp at line 11; cannot redeclare at /home/bewang/temp/puppet/require.pp:18 on node pmaster.puppet-test.com
    
  • You can safely include a class, first two examples pass, but you cannot declare class inner after outer_a or outer_b like the third one:
    class inner {
    }
    
    class outer_a {
      include inner
    }
    
    class outer_b {
      include inner
    }
    
    class { "inner": }
    include outer_a
    include outer_b
    
    class { "inner": }
    class { "outer_a": }
    class { "outer_b": }
    
    include outer_a
    include outer_b
    class { "inner": } # Duplicate redeclaration error
    

Thursday, December 6, 2012

Hive Metastore Configuration

Recently I wrote a post for Bad performance of Hive meta store for tables with large number of partitions. I did tests in our environment. Here is what I found:

  • Don't configure a hive client to access remote MySQL database directly as follows. The performance is really bad, especially when you query a table with a large number of partitions.
  •    
        javax.jdo.option.ConnectionURL   
        jdbc:mysql://mysql_server/hive_meta
       
       
        javax.jdo.option.ConnectionDriverName   
        com.mysql.jdbc.Driver
    
    
        javax.jdo.option.ConnectionUserName
        hive_user
    
    
        javax.jdo.option.ConnectionPassword
        password
    
    
  • Must start Hive metastore service on the same server where Hive MySQL database is running.
    • On database server, use the same configuration as above
    • Start the hive metasore service
    • hive --service metastore
      # If use CDH
      yum install hive-metastore
      /sbin/service hive-metastore start
      
    • On hive client machine, use the following configuration.
    •    
          hive.metastore.uris   
          thrift://mysql_server:9083
         
      
    • Don't worry if you see this error message.
    • ERROR conf.HiveConf: Found both hive.metastore.uris and javax.jdo.option.ConnectionURL Recommended to have exactly one of those config keyin configuration
The reason is: when Hive does partition pruning, it will read a list of partitions. The current metastore implementation uses JDO to query the metastore database:
  1. Get a list of partition names using db.getPartitionNames()
  2. Then call db.getPartitionsByName(List<Strin> partNames). If the list is too large, it will load in multiple times, 300 for each load by default. The JDO calls like this
    • For one MPartition object.
    • Send 1 query to retrieve MPartition basic fields.
    • Send 1 query to retrieve MStorageDescriptors
    • Send 1 query to retrieve data from PART_PARAMS.
    • Send 1 query to retrieve data from PARTITION_KEY_VALS.
    • ...
    • Totally 10 queries for one MPartition. Because MPartition will be converted into Partition before send by, all fields will be populated
  3. If one query takes 40ms in my environment. And you can calculate how long does it take for thousands partitions.
  4. Using remote Hive metastore service, all those queries happens locally, it won't take that long for each query, so you can get performance improved significantly. But there are still a lot of queries.

I also wrote ObjectStore using EclipseLink JPA with @BatchFetch. Here is the test result, it will at least 6 times faster than remote metastore service. It will be even faster.

PartitionsJDO Remote MySQL
Remote Service
EclipseLink
Remote MySQL
106,142353569
10057,0763,914940
200116,2165,2541,211
500287,41621,3853,711
1000574,60639,8466,652
3000
132,64519,518