I'm trying to load gzipped files from a directory on a remote machine onto the HDFS of my local machine. I want to be able to read the gzipped files from the remote machine and pipe them directly into the HDFS on my local machine. This is what I've got on the local machine:
ssh remote-host "cd /files/wanted; tar -cf - *.gz" | tar -xf - | hadoop fs -put - "/files/hadoop"
This apparently copies all of the gzipped files from the remote path specified to the path where I execute the command and loads an empty file
- into the HDFS. The same thing happens if I try it without
ssh remote-host "cd /files/wanted; cat *.gz" | hadoop fs -put - "/files/hadoop"
Just for shits and giggles to see if I was maybe missing something simple, I tried the following on my local machine:
tar -cf - *.gz | tar -xf -C tmp
This did what I expected, it took all of the gzipped files in the current directory and put them in an existing directory
Then with the Hadoop part on the local machine:
cat my_file.gz | hadoop fs -put - "/files/hadoop"
This also did what I expected, it put my gzipped file into
/files/hadoop on the HDFS.
Is it not possible to pipe multiple files into HDFS?