Friday, January 29, 2016

Using Hadoop distcp copy files from a SFTP server

Here are the steps how I use "hadoop distcp" to copy files from a SFTP server to HDFS:
  • Clone hadoop-filesystem-sftp at
  • There is a bug in hadoop-filesystem-sftp which may block you running distcp correctly when you have special characters in the file names which should be escaped, e.g. ":". The fix is very simple. You can find line 331 in, and encode the sftpFile.filename.
      for (SFTPv3DirectoryEntry sftpFile : sftpFiles) {
       String filename = URLEncoder.encode(sftpFile.filename, "UTF-8");
       if (!"..".equals(filename) && !".".equals(filename))
        fileStats.add(getFileStatus(sftpFile.attributes, new Path(path, filename).makeQualified(this)));
  • If using password, it might be easy, but your password to SFTP server will be public because it will be shown in MapReduce job configuration.
  • hadoop-filesystem-sftp using ganymed-ssh-2 which only supports authentication using password or keyfile.
  • Here is how to set up passwordless SSH, you need permission to log on the sftp server. Create a ssh key pair using "ssh-keygen". Make sure you don't overwrite your current key pair in $HOME/.ssh
    $ ssk-keygen -f ${distcp_ssh}/keyfile
    $ ssh-keygen -F sftp-server-name -f ${distcp_ssh}/known_hosts
  • Copy the public key to the sftp server.
    • Copy ${distcp_ssh} to all data nodes and the client node. And you need to set the dir read only by yarn.
    • On the client node. You need to set this dir readable by the user you use to run "hadoop distcp"
    • The reason doing like this is that hadoop-filesystem-sftp trying to use ${user.home} for the default path for the key file and known_hosts. And the more important reason is that the task is run as yarn instead of the user running the command on each data node. Unfortunately, some one can write a mapreduce job to grab your id_rsa key file and gain the access to SFTP server.
  • hadoop distcp -D fs.sftp.user=username -D fs.sftp.key.file=${distcp_ssh}/id_rsa -D fs.sftp.knownhosts=${distcp_ssh}/known_hosts -libjars hadoop-filesystem-sftp-0.0.1-SNAPSHOT-jar-with-dependencies.jar sftp://sftp-server/src-path hdfs://namenode/target-path
  • WARNING: Don't use this method unless you have to.

1 comment:

  1. Does this work to copy from Windows SFTP Server to Linux Server(where HDFS lies)?