using p2p to solve mirror bottleneck for the SOURCES?

brazilian_joe · Tux's lil' helper Joined: 14 Mar 2003 Posts: 99

Hey, I have already read some threads with people suggesting a p2p system of fetching prebuilt packages to speed up installation, and others arguing about how unfeasible it is due to USE flags and many different architectures etc. But how about the sources we fetch while emerging a package? I have faced 6Kbps download speed, having a 600 Kbps connection, and this wastes quite some time, esp. if we want a security update (or kde). Once the file is on my harddisk, I wouldn't mind uploading it to other gentoo users. I don't know if it would be using (insert favorite p2p) or some distributed filesystem or its own protocol, but these are some desirability factors:

- Ability to download chunks from many places at once, since we want the file to arrive as fast as possible. Is there a distributed filesystem that does this? p2p filesharing have this.

- Ability do cap upload speed. I need my bandwidth for myself, if my portage sharing maxes out my upload speed, i wont use it. It could be emerge sensitive, so that when I am not emerging anything, it has a low cap, but when I emerge something, it can push up the cap or even uncap it. Maybe even (advanced) selectively pushing up the cap depending on the source size.

- People are concerned on security and data integrity. Someone could have altered sources in his/her machine, maliciously or otherwise (ebuild developer, corrupted download, evil cracker of doom). There should be a system to check the data chunks as they are downloaded, and leaving out someone whose files dont match the expected 'pattern, checksum or whatever'.

I cant code, but I can think and suggest stuff, maybe I will find some time to learn c/c++ soon. Is someone actually doing some thing about it, or discussing the actual implementation for this?

Braempje · l33t Joined: 31 Jan 2003 Posts: 748

There are ideas about using bittorrent...
Btw: just emerge mirrorselect, and you'll have much faster mirrors. I don't know which mirrors you use for the moment, but out here in Europe there are plenty of mirrors that are way faster than ibiblio or oregonstate...
_________________
Dictionary of the Flemish Sign Language - Woordenboek Vlaamse Gebarentaal

panserg · Apprentice Joined: 16 Apr 2003 Posts: 188

brazilian_joe · Tux's lil' helper Joined: 14 Mar 2003 Posts: 99

I am no C/C++ coder. I develop websites. Though, I am studying CS, and will develop these skills soon. But I really believe emerge it the best tool for mantaining a system up-to-date. Still, there is room for improvement. I only have this idea and think how it should work, but there are people better than me who can improve/correct those concepts and hopefully implement it. I will elaborate a little more on my idea:

Suppose gentoo.org has a service similar (maybe equal) to bittorrent's tracker, it has info on the packages, their 'magic number', and (note: I dont know bittorrent's internal details) suppose we have 'standard chunk size' such as 64k (wild guess). Each chunk will also be uniquely identified with its 'magic number', like a checksum (dont know if checksums can be faked, it does not matter now, just follow the idea). So, I (1) emerge sync and then I know the file's and chunks' 'magic numbers'. the 'magic numbers' would have to be a)unique and b)a function of the chunks (is it mathematically possible?). (2)I 'emerge fubar' and receive from the trusted source (gentoo.org or a mirror, we would have to have strict rules for these mirrors so that they can also be trusted) a list of users with this 'fubar' source file, and try to get some chunks from them.
(3)The emerge starts connecting to the other peers.
If the other peer gives me a "whole file" magic number different from gentoo.org (which is my trusted source), I wont even bother trying to download from him.
(4)Every chunk which is downloaded is checked, so if it fails it is discarded. The trusted source is notified of both peers which chunks are correct and peers which chunks are wrong, so that it sorts them out (each file would have an independent 'peer sorting').
This way, i dont have to delete that 20MB file and download it all over again. Just the failed chunk will be refetched.
It would have to work from behind firewalls, without user intervention, and any source I download becomes instantly available for others (at least on the default configuration), preferrably even while I am still downloading .
Maybe bittorrent already does this, maybe even better. i dont know, it is jut a crazy idea. Is this just 'reinventing the wheel'? Bit Torrent solves all this? It would still need the glue to become a portage backend. Expert advice welcome. Should I post a wishlist/bug with all this? I think we can talk more about it here in the forum, and if something matures from it, we can file a bug.

ferringb · Retired Dev Joined: 03 Apr 2003 Posts: 357

panserg · Apprentice Joined: 16 Apr 2003 Posts: 188

Braempje · l33t Joined: 31 Jan 2003 Posts: 748

I personally think this thread is a replica of this
thread... Rsync is a problem, but I'm starting to think that it's just a matter of someone putting loads of time in a new solution, and then trying to convince all gentoo developpers...
I personally have some ideas, but I just don't have the time. Maybe in a couple of months, I'll let you know

_________________
Dictionary of the Flemish Sign Language - Woordenboek Vlaamse Gebarentaal

brazilian_joe · Tux's lil' helper Joined: 14 Mar 2003 Posts: 99

The problem I see is simple in its concept: when I 'emerge kde' it takes a loong time trickling at 10kbps when I have 600kbps connection. And when my config'ed server goes down, I have to play with the config and change the mirror to emerge the app. but, if instead the files are distributed in a p2p fashion, and my box downloads from multiple sources, the files come to me faster. That is the heart of the issue I proposed when I started this thread. The objective is to find a solution/improvement , at least in concept, so that the system can become more resilient against failures (there has been one recently), and that the updates can get to the user faster.

ferringb · Retired Dev Joined: 03 Apr 2003 Posts: 357

Braempje · l33t Joined: 31 Jan 2003 Posts: 748

As far as I'm concerned you just have bad mirrors. I download everything at a speed of at least 300 kB/s. Just try finding a mirror in your neighbourhood, there's always a server around...
_________________
Dictionary of the Flemish Sign Language - Woordenboek Vlaamse Gebarentaal

ferringb · Retired Dev Joined: 03 Apr 2003 Posts: 357

panserg · Apprentice Joined: 16 Apr 2003 Posts: 188

I see there are people suffering from bad mirrors and from inability to choose proper mirrors.

As I mentioned already, without p2p there is no way to improve the situation. Mirrorselect measures a network, while mirrors are slow becase they are overloaded. But, if people select a better mirror and altogether switch to it - the new one will be dead in no time. We need a way when emerge will choose the mirror from the list *dynamically* after getting the health report of briefly measured performance of all mirror servers and statically published on the web. That's the way to distribute the load smoothly.

Perhaps, emerge should adopt smarter way of measuring the current download performance and switching from mirror to mirror. However, how will it help with "emerge rsync" without rsyncing to temporary stage-area? If "emerge rsync" will be aborted you'll end up with a broken tree - what could be worse? Syncing the tree must be smarter.

Another way to improve the situation is to adapt something better than rsync, something transaction-log-replication based. And even with CVS it is possible (loginfo, commitinfo and similar hooks). But unfortunately many developers are addicted to rsync. So, no way we can change anything here.

I may look like I am wasting time of busy people. However, once the popularity of Gentoo will grow the problem we discuss today will kill many mirrors and it will make many new Gentooers very frustrated. I suggest it's better to think and to do something about it now rather then when it will be too late.

All other distros do not have such a problem. Most of other Linux distros are not source code based and/or they don't rsync. FreeBSD user base is very small, besides, more stable FreeBSDers do not sync twice a day as we do

ferringb · Retired Dev Joined: 03 Apr 2003 Posts: 357

Genone · Posted: Wed Jun 04, 2003 9:15 pm Post subject:

Sorry, haven't read all posts so maybe this is old information, but if my memory is right carpaski is working on some bittorrent stuff for portage (or wanted to). On the other hand, my memory is very vague :wink:

But at least on the -dev list there were some positive dev-opinions on that issue.