Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
*FASTEST* copy of a few million small files around 5.5TB
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
njcwotx
Guru
Guru


Joined: 25 Feb 2005
Posts: 587
Location: Texas

PostPosted: Fri Sep 30, 2016 10:58 pm    Post subject: *FASTEST* copy of a few million small files around 5.5TB Reply with quote

I did a little research and the consensus is all over the board on this.


I have to transfer file from an SMB share mounted on a windows server using ntfs to a linux host on ext4. Its millions of small files and about 5.5TB total. Even on a 10Gig network, its copying about 20G/hr. I estimate over 10days to copy.

to start, we mounted the share on the linux host and I think dev is using rsync to move the files.

Doing a little research and I see some suggestions of using tar via ssh ( I have done this in past but dont remember how fast it was). Other suggest using netcat or scp.

10+ days is a long time to run a copy. I wonder if anyone has crossed this bridge before :)
_________________
Drinking from the fountain of knowldege.
Sometimes sipping.
Sometimes gulping.
Always thirsting.
Back to top
View user's profile Send private message
chithanh
Developer
Developer


Joined: 05 Aug 2006
Posts: 2158
Location: Berlin, Germany

PostPosted: Fri Sep 30, 2016 11:17 pm    Post subject: Reply with quote

I think SMB is the bottleneck here.
Can't you run rsync on the Windows server directly?

scp is definitely slower than rsync if you have small files.
Back to top
View user's profile Send private message
njcwotx
Guru
Guru


Joined: 25 Feb 2005
Posts: 587
Location: Texas

PostPosted: Sat Oct 01, 2016 12:00 am    Post subject: Reply with quote

Dev did a mount -t cifs on Linux. Then a local rsync. So u think an rsync from Windows direct is faster?
_________________
Drinking from the fountain of knowldege.
Sometimes sipping.
Sometimes gulping.
Always thirsting.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21635

PostPosted: Sat Oct 01, 2016 12:25 am    Post subject: Reply with quote

mount -t cifs means you are using SMB/CIFS to transport the files from the Windows machine to the Linux kernel, which then exposes them to rsync as a regular filesystem. If you run an rsync daemon on the Windows machine, you can use the rsync protocol, which may be more efficient.
Back to top
View user's profile Send private message
chithanh
Developer
Developer


Joined: 05 Aug 2006
Posts: 2158
Location: Berlin, Germany

PostPosted: Sat Oct 01, 2016 8:52 am    Post subject: Reply with quote

You could also run the rsync daemon on the Linux host and an rsync client on the Windows machine, which might be easier to set up.
Back to top
View user's profile Send private message
toralf
Developer
Developer


Joined: 01 Feb 2004
Posts: 3922
Location: Hamburg

PostPosted: Sat Oct 01, 2016 11:15 am    Post subject: Reply with quote

The fastest method IMO is something like
Code:
tar -cjpf- ./ | (ssh user@host "cd foo/bar; tar -xjpf-)
A nifty side effect of ssh is that the data aren't corrupted (during transfer) - but maybe run sync afterwards to verify it.
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2977
Location: Germany

PostPosted: Sat Oct 01, 2016 11:19 am    Post subject: Reply with quote

sometimes its faster to just carry the hard disks over :lol:

sending the raw filesystem image over could be faster than tar. it depends.

seriously though, rsync is not the worst solution by far. it might lack speed but if there is any problem during the transfer, rsync resumes easily where the others do not.
Back to top
View user's profile Send private message
Naib
Watchman
Watchman


Joined: 21 May 2004
Posts: 6051
Location: Removed by Neddy

PostPosted: Sat Oct 01, 2016 11:32 am    Post subject: Reply with quote

frostschutz wrote:
sometimes its faster to just carry the hard disks over :lol:

sending the raw filesystem image over could be faster than tar. it depends.

seriously though, rsync is not the worst solution by far. it might lack speed but if there is any problem during the transfer, rsync resumes easily where the others do not.
exactly :) a flashdrive strapped to a pigeon is still very fast :)
if you really need todo it via network remember compression should be faster so ... compress --> netcat --> uncompress
_________________
Quote:
Removed by Chiitoo
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3137

PostPosted: Sun Oct 02, 2016 8:15 pm    Post subject: Reply with quote

Quote:
Dev did a mount -t cifs on Linux. Then a local rsync.

Don't do that. Ever.
This way rsync is not aware that the transfer goes over network and will not attempt to play it smart.
Quote:
So u think an rsync from Windows direct is faster?

Yes.

Quote:
tar -cjpf- ./ | (ssh user@host "cd foo/bar; tar -xjpf-)
+1 for tar. You may consider filtering it through gzip (or pigz for loading more of your CPUs) if your network limits you.
Also, I happened to send the stream through netcat. Particularly useful if you lack processing power for heavy number crunching (though I don't recommend doing this over public network)
Back to top
View user's profile Send private message
toralf
Developer
Developer


Joined: 01 Feb 2004
Posts: 3922
Location: Hamburg

PostPosted: Sun Oct 02, 2016 9:07 pm    Post subject: Reply with quote

szatox wrote:
Quote:
tar -cjpf- ./ | (ssh user@host "cd foo/bar; tar -xjpf-)
+1 for tar. You may consider filtering it through gzip (or pigz for loading more of your CPUs)
-j calls bzip2, or ?
Back to top
View user's profile Send private message
njcwotx
Guru
Guru


Joined: 25 Feb 2005
Posts: 587
Location: Texas

PostPosted: Fri Oct 14, 2016 9:17 pm    Post subject: Reply with quote

Well, I got back to work, found an older windows vm with same type of file set and tested rsync from windows and the copy moved very fast from this box to linux. Didnt try the reverse on test though.

Showed this to boss and dev guy. They stopped their copy and we did the rsync from prod windows host only to find out it still crawled terribly copying that same set of files. Does not appear to be any faster, although looking at a network perf graph, it does appear to be smoother on transfer, but overall we calculate its going to take a very long time.

these are scanned images, millions of small kilobyte sized files and a few larger ones scattered in there. immense folder tree. Some things just take forever.
_________________
Drinking from the fountain of knowldege.
Sometimes sipping.
Sometimes gulping.
Always thirsting.
Back to top
View user's profile Send private message
1clue
Advocate
Advocate


Joined: 05 Feb 2006
Posts: 2569

PostPosted: Sat Oct 15, 2016 1:10 am    Post subject: Reply with quote

Doesn't tar have a maximum file size? You might have to segment the archive? Note I'm not talking about making a tar file on the source system and then sending it over.

If you have a 10gbps secure (non-Internet) network then I think ssh is unnecessary. It adds encryption which some cpus aren't so fast at.

But +1 on something to combine all the small files into one or more big files, preferably compressed, and then sending the big file.

Edit: Gnu tar has a maximum file size of 8gb on the compressed side.
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2977
Location: Germany

PostPosted: Sat Oct 15, 2016 1:20 am    Post subject: Reply with quote

1clue wrote:
Edit: Gnu tar has a maximum file size of 8gb on the compressed side.


I think you confused that with one of the archaic tar formats ...
Back to top
View user's profile Send private message
russK
l33t
l33t


Joined: 27 Jun 2006
Posts: 665

PostPosted: Sat Oct 15, 2016 1:49 am    Post subject: Reply with quote

Are these machines in close physical proximity? I just wonder if sneakernet with a USB 3.0 drive would be faster in the long run.

Regards
Back to top
View user's profile Send private message
Akkara
Bodhisattva
Bodhisattva


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Sat Oct 15, 2016 5:48 am    Post subject: Reply with quote

1clue wrote:
But +1 on something to combine all the small files into one or more big files, preferably compressed, and then sending the big file.


If you have a 10Gb/s connection, I advise against compression. I often find that tar -j can't even saturate a 1Gb connection. Tar -z is faster, but I don't know if it is fast enough to not be the limiting factor in your transfer.
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.
Back to top
View user's profile Send private message
Naib
Watchman
Watchman


Joined: 21 May 2004
Posts: 6051
Location: Removed by Neddy

PostPosted: Sat Oct 15, 2016 8:39 am    Post subject: Reply with quote

Akkara wrote:
1clue wrote:
But +1 on something to combine all the small files into one or more big files, preferably compressed, and then sending the big file.


If you have a 10Gb/s connection, I advise against compression. I often find that tar -j can't even saturate a 1Gb connection. Tar -z is faster, but I don't know if it is fast enough to not be the limiting factor in your transfer.
unless there is literally millions of small files. The file access becomes the overhead not the overall size. This includes the host system and the request to read BUT the target machine where the OS needs to be asked for a filesystem entry.

I ran into this the other day when I tried to copy a 10gig directory with half million files.... With it tarred it was quicker. Sure you still need to read and write a table entry during the tar and untar. BUT you don't have a network handshaking in the way slowing it down
_________________
Quote:
Removed by Chiitoo
Back to top
View user's profile Send private message
1clue
Advocate
Advocate


Joined: 05 Feb 2006
Posts: 2569

PostPosted: Sat Oct 15, 2016 6:34 pm    Post subject: Reply with quote

@frostschutz, I recently tried to make an archive of large database backups and ran into a size limitation, going onto an xfs filesystem. Not sure exactly now what limit I hit, but there is one.

@russk, sneakernet is clearly better if that's an option.

@Akkara, The 10gbps connection depends on actual throughput. 10gbe is highly sensitive to tuning and hardware. Worst case some 10gbps ethernet connections get not much more than 3-4 gbps. That said, not all hardware is equal for disk access or compression.

For example, I have an i7 920 and an atom c2758. The atom has hardware support for encryption and compression. The atom is faster than the i7 with respect to encryption and compression, maintaining ~2.5gbps compressing or encrypting, and more than 2gbps for both combined. When doing both encryption and compression, the i7 struggles along at 1gbps. For things like compilation, the i7 beats the pants off of the atom, like you would expect.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum