Friday, January 29, 2016

Using Hadoop distcp copy files from a SFTP server

Here are the steps how I use hadoop distcp to copy files from a SFTP server to HDFS:
There is a bug in hadoop-filesystem-sftp which may block you running distcp correctly when you have special characters in the file names which should be escaped, e.g. “:”. The fix is very simple. You can find line 331 in SFTPFileSystem.java, and encode the sftpFile.filename.
for (SFTPv3DirectoryEntry sftpFile : sftpFiles) { 
  String filename = URLEncoder.encode(sftpFile.filename, “UTF-8”); 
  if (!”..”.equals(filename) && !”.”.equals(filename))
    fileStats.add(getFileStatus(sftpFile.attributes, new Path(path, filename).makeQualified(this))); 
}
If using password, it might be easy, but your password to SFTP server will be public because it will be shown in MapReduce job configuration. hadoop-filesystem-sftp uses ganymed-ssh-2 which only supports authentication using password or keyfile.
Here is how to set up passwordless SSH, you need permission to log on the sftp server. Create a ssh key pair using ssh-keygen. Make sure you don’t overwrite your current key pair in $HOME/.ssh.
$ ssk-keygen -f ${distcp_ssh}/keyfile $ ssh-keygen -F sftp-server-name -f ${distcp_ssh}/known_hosts
Copy the public key to the sftp server.
Copy ${distcp_ssh} to all data nodes and the client node. And you need to set the dir read only by yarn.
On the client node. You need to set this dir readable by the user you use to run hadoop distcp
The reason is: hadoop-filesystem-sftp will try to use ${user.home} for the default path of the key file and known_hosts, and the task will be run as yarn on each data node instead of the user who start the command. Unfortunately, some one can write a MapReduce job to grab your id_rsa key file and gain the access to SFTP server.
hadoop distcp -D fs.sftp.user=username -D fs.sftp.key.file=${distcp_ssh}/id_rsa -D fs.sftp.knownhosts=${distcp_ssh}/known_hosts -libjars hadoop-filesystem-sftp-0.0.1-SNAPSHOT-jar-with-dependencies.jar sftp://sftp-server/src-path hdfs://namenode/target-path
WARNING: Don’t use this method unless you have to.

1 comment:

  1. Does this work to copy from Windows SFTP Server to Linux Server(where HDFS lies)?

    ReplyDelete