Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
using p2p to solve mirror bottleneck for the SOURCES?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Gentoo Chat
View previous topic :: View next topic  
Author Message
brazilian_joe
Tux's lil' helper
Tux's lil' helper


Joined: 14 Mar 2003
Posts: 99

PostPosted: Mon Jun 02, 2003 5:11 pm    Post subject: using p2p to solve mirror bottleneck for the SOURCES? Reply with quote

Hey, I have already read some threads with people suggesting a p2p system of fetching prebuilt packages to speed up installation, and others arguing about how unfeasible it is due to USE flags and many different architectures etc. But how about the sources we fetch while emerging a package? I have faced 6Kbps download speed, having a 600 Kbps connection, and this wastes quite some time, esp. if we want a security update (or kde). Once the file is on my harddisk, I wouldn't mind uploading it to other gentoo users. I don't know if it would be using (insert favorite p2p) or some distributed filesystem or its own protocol, but these are some desirability factors:

- Ability to download chunks from many places at once, since we want the file to arrive as fast as possible. Is there a distributed filesystem that does this? p2p filesharing have this.

- Ability do cap upload speed. I need my bandwidth for myself, if my portage sharing maxes out my upload speed, i wont use it. It could be emerge sensitive, so that when I am not emerging anything, it has a low cap, but when I emerge something, it can push up the cap or even uncap it. Maybe even (advanced) selectively pushing up the cap depending on the source size.

- People are concerned on security and data integrity. Someone could have altered sources in his/her machine, maliciously or otherwise (ebuild developer, corrupted download, evil cracker of doom). There should be a system to check the data chunks as they are downloaded, and leaving out someone whose files dont match the expected 'pattern, checksum or whatever'.

I cant code, but I can think and suggest stuff, maybe I will find some time to learn c/c++ soon. Is someone actually doing some thing about it, or discussing the actual implementation for this?
Back to top
View user's profile Send private message
Braempje
l33t
l33t


Joined: 31 Jan 2003
Posts: 748

PostPosted: Mon Jun 02, 2003 5:18 pm    Post subject: Reply with quote

There are ideas about using bittorrent...
Btw: just emerge mirrorselect, and you'll have much faster mirrors. I don't know which mirrors you use for the moment, but out here in Europe there are plenty of mirrors that are way faster than ibiblio or oregonstate...
_________________
Dictionary of the Flemish Sign Language - Woordenboek Vlaamse Gebarentaal
Back to top
View user's profile Send private message
panserg
Apprentice
Apprentice


Joined: 16 Apr 2003
Posts: 188

PostPosted: Mon Jun 02, 2003 5:38 pm    Post subject: Reply with quote

Braempje wrote:
There are ideas about using bittorrent...


1. I support it. we, gentoo distro users, are not a static system. We are unpredictable. P2P is the only way to adapt to it.

Braempje wrote:
Btw: just emerge mirrorselect, and you'll have much faster mirrors.


2. The fastest way to the mirror doesn't mean the fastest responsivity of the mirror. I am constantly having the situation when I have to roll back my mirror settings after mirrorselct screw it up choosing absolutely unresponsive mirror. That's perhaps b/c mirrorselect doesn't actually measure CPU and I/O load of the actual mirror server. It measures ICMP echo, but I don't use ICMP in rsync: it's much more than that.

3. As an alternative to P2P I see to abandon of RSYNC in a favor of another protocol, which will take a real difference, not a whole snapshot. Well, or rsync should be used by emerge more intelligent.

Are there any portage developers who read this thread? Can you make us a favor and clarify your plans?
Back to top
View user's profile Send private message
brazilian_joe
Tux's lil' helper
Tux's lil' helper


Joined: 14 Mar 2003
Posts: 99

PostPosted: Mon Jun 02, 2003 6:32 pm    Post subject: My Plans Reply with quote

I am no C/C++ coder. I develop websites. Though, I am studying CS, and will develop these skills soon. But I really believe emerge it the best tool for mantaining a system up-to-date. Still, there is room for improvement. I only have this idea and think how it should work, but there are people better than me who can improve/correct those concepts and hopefully implement it. I will elaborate a little more on my idea:

Suppose gentoo.org has a service similar (maybe equal) to bittorrent's tracker, it has info on the packages, their 'magic number', and (note: I dont know bittorrent's internal details) suppose we have 'standard chunk size' such as 64k (wild guess). Each chunk will also be uniquely identified with its 'magic number', like a checksum (dont know if checksums can be faked, it does not matter now, just follow the idea). So, I (1) emerge sync and then I know the file's and chunks' 'magic numbers'. the 'magic numbers' would have to be a)unique and b)a function of the chunks (is it mathematically possible?). (2)I 'emerge fubar' and receive from the trusted source (gentoo.org or a mirror, we would have to have strict rules for these mirrors so that they can also be trusted) a list of users with this 'fubar' source file, and try to get some chunks from them.
(3)The emerge starts connecting to the other peers.
If the other peer gives me a "whole file" magic number different from gentoo.org (which is my trusted source), I wont even bother trying to download from him.
(4)Every chunk which is downloaded is checked, so if it fails it is discarded. The trusted source is notified of both peers which chunks are correct and peers which chunks are wrong, so that it sorts them out (each file would have an independent 'peer sorting').
This way, i dont have to delete that 20MB file and download it all over again. Just the failed chunk will be refetched.
It would have to work from behind firewalls, without user intervention, and any source I download becomes instantly available for others (at least on the default configuration), preferrably even while I am still downloading .
Maybe bittorrent already does this, maybe even better. i dont know, it is jut a crazy idea. Is this just 'reinventing the wheel'? Bit Torrent solves all this? It would still need the glue to become a portage backend. Expert advice welcome. Should I post a wishlist/bug with all this? I think we can talk more about it here in the forum, and if something matures from it, we can file a bug.
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Mon Jun 02, 2003 8:22 pm    Post subject: Reply with quote

panserg wrote:
3. As an alternative to P2P I see to abandon of RSYNC in a favor of another protocol, which will take a real difference, not a whole snapshot. Well, or rsync should be used by emerge more intelligent.

Eh? Unless I'm misunderstanding what you're stating, rsync *does* take the difference (delta) between the users portage tree and the master tree- eg non- snapshot. Emerge w/ rsync already is pretty much as decent as you can get it, although cvsup could possibly do better.

I'd be curious what complaints you have against rsync, aside from the serious hammering the rsync servers take from a processor standpoint. Also, I'd wonder what you'd use if rsync weren't defacto...
Back to top
View user's profile Send private message
panserg
Apprentice
Apprentice


Joined: 16 Apr 2003
Posts: 188

PostPosted: Mon Jun 02, 2003 9:19 pm    Post subject: Reply with quote

ferringb wrote:
I'd be curious what complaints you have against rsync


Do "emerge rsync" and once it will finish do it again. On a second run measure a traffic going through network and explain - why is it so big counting the fact that nothing really should be syncronized?

I guess rsync will check *EVERY* file for its timestamp before getting a difference. 50,000 files - how many timestamps are transfered? And what, only file timestamps or director listings as well? How many folder to list and transfer?

I would prefer to have a real replication, when the main server keeps a serialized transaction log and a replica recieves transactions in a same order as on the main server each time downloading only new (not applied on the replica yet) ones. In other words, I want "emerge sync" to check the serial number and if it's the same - stop as no need for any replication. Whole discussion about how often should we rsync could be obsolete with a correct (transaction-log based) replication.

I understand that with CVS repository it would be a challenge (but still possible using commit/loginfo hooks), while Bitkeeper might not be an option as it's not exactly free. I wonder if Aegis or OpenCM or other (more advanced) SCM software was considered instead of CVS+rsync.
Back to top
View user's profile Send private message
Braempje
l33t
l33t


Joined: 31 Jan 2003
Posts: 748

PostPosted: Tue Jun 03, 2003 4:59 am    Post subject: Reply with quote

I personally think this thread is a replica of this
thread... Rsync is a problem, but I'm starting to think that it's just a matter of someone putting loads of time in a new solution, and then trying to convince all gentoo developpers...
I personally have some ideas, but I just don't have the time. Maybe in a couple of months, I'll let you know :)
_________________
Dictionary of the Flemish Sign Language - Woordenboek Vlaamse Gebarentaal
Back to top
View user's profile Send private message
brazilian_joe
Tux's lil' helper
Tux's lil' helper


Joined: 14 Mar 2003
Posts: 99

PostPosted: Tue Jun 03, 2003 4:54 pm    Post subject: Reply with quote

The problem I see is simple in its concept: when I 'emerge kde' it takes a loong time trickling at 10kbps when I have 600kbps connection. And when my config'ed server goes down, I have to play with the config and change the mirror to emerge the app. but, if instead the files are distributed in a p2p fashion, and my box downloads from multiple sources, the files come to me faster. That is the heart of the issue I proposed when I started this thread. The objective is to find a solution/improvement , at least in concept, so that the system can become more resilient against failures (there has been one recently), and that the updates can get to the user faster.
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Tue Jun 03, 2003 6:26 pm    Post subject: Reply with quote

panserg wrote:
Do "emerge rsync" and once it will finish do it again. On a second run measure a traffic going through network and explain - why is it so big counting the fact that nothing really should be syncronized?

Well, think about what rsync is doing (and this is what makes it good sh*t)- first off, it must compactly represent what the client has, and the server must do the same. It does this via generating checksum's of the data, then the client uploading it to the server... from there, the server figures out what must be sent back.
Basically, while you know that it doesn't need to be synchronized because you just sync'ed the two, rsync doesn't, and I'd think you'd be hard pressed to find/create a setup that could do that. Yes in a scenario where you're attempting to rsync immediately after a successfull update, it's not-optimal because it check's everything.
On the other hand, when you're not doing something atypical, it does quite well since it can identify only what has changed. Pros and cons I guess...

panserg wrote:
I would prefer to have a real replication, when the main server keeps a serialized transaction log and a replica recieves transactions in a same order as on the main server each time downloading only new (not applied on the replica yet) ones. In other words, I want "emerge sync" to check the serial number and if it's the same - stop as no need for any replication. Whole discussion about how often should we rsync could be obsolete with a correct (transaction-log based) replication.

It just dawned on me while trying to respond to your post, one could probably cut down on the overhead by rsync'ing against a log that details what files have been inserted into the tree- I say rsync since the log would be large, and rsync could easily identify only what has changed (typically the tail of it). From there, identify where the last 'emerge sync' left off in the list, and work you're way to the tail of the file. Alternatively, break the log into multiple files, month/day in terms of updates. I could see such a setup working nicely, since w/ the log info you're not dealing w/ each file, just a log of the changes to the tree. Much, much less required for identifying what has changed/been added.

As for doing some type of serialized stamp on the portage tree, that method (imo) would fail/go worst case under too many common conditions- what of if someone is syncing up, and the connection kills? If you're breaking the portage tree updates into individual states via a serial id method, for a failed sync you'd have to either A) roll back all changes made so you match a known serial id, or B) come up w/ some method to identify a tree that is mostly a certain id, but isn't completley that state.
That's kind of a cruddy description (low on time), but I hope you see what I mean...


Last edited by ferringb on Tue Jun 03, 2003 7:15 pm; edited 1 time in total
Back to top
View user's profile Send private message
Braempje
l33t
l33t


Joined: 31 Jan 2003
Posts: 748

PostPosted: Tue Jun 03, 2003 6:50 pm    Post subject: Reply with quote

As far as I'm concerned you just have bad mirrors. I download everything at a speed of at least 300 kB/s. Just try finding a mirror in your neighbourhood, there's always a server around...
_________________
Dictionary of the Flemish Sign Language - Woordenboek Vlaamse Gebarentaal
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Tue Jun 03, 2003 7:13 pm    Post subject: Reply with quote

brazilian_joe wrote:
The problem I see is simple in its concept: when I 'emerge kde' it takes a loong time trickling at 10kbps when I have 600kbps connection.

Well, rather then switching to a different distribution method, why not just modify emerge so that it kills the connection attempt and trys a different mirror if the download speed isn't satisfactory? Via wget, we can already specify a maximum download rate, attempt to setup a minimal download rate.
This would likely require modification to wget, but it seems a simpler solution then attempting to get a p2p setup going.
brazilian_joe wrote:
And when my config'ed server goes down, I have to play with the config and change the mirror to emerge the app.

Specifying multiple mirrors in GENTOO_MIRRORS ought to accomplish what you're talking about- only downside is you may have to suffer the timeout period. The functionality is there, why not just tweak the timeout values/mirror's to deal w/ that? Eg- mirrors has your personal mirror specified first, from there listing of the other mirrors it should bother should your's fail...
brazilian_joe wrote:
if instead the files are distributed in a p2p fashion, and my box downloads from multiple sources, the files come to me faster. That is the heart of the issue I proposed when I started this thread. The objective is to find a solution/improvement , at least in concept, so that the system can become more resilient against failures (there has been one recently), and that the updates can get to the user faster.

In terms of resilience against failure, the multiple mirror setup is fairly tolerant. As I recall, it was just oregon state that defib'd- unless I'm completely on crack, I don't recall ever (personally) being affected by it, and if I was my setup probably fell back to the purdue mirror I sometimes use.
Back to top
View user's profile Send private message
panserg
Apprentice
Apprentice


Joined: 16 Apr 2003
Posts: 188

PostPosted: Wed Jun 04, 2003 1:18 am    Post subject: Reply with quote

I see there are people suffering from bad mirrors and from inability to choose proper mirrors.

As I mentioned already, without p2p there is no way to improve the situation. Mirrorselect measures a network, while mirrors are slow becase they are overloaded. But, if people select a better mirror and altogether switch to it - the new one will be dead in no time. We need a way when emerge will choose the mirror from the list *dynamically* after getting the health report of briefly measured performance of all mirror servers and statically published on the web. That's the way to distribute the load smoothly.

Perhaps, emerge should adopt smarter way of measuring the current download performance and switching from mirror to mirror. However, how will it help with "emerge rsync" without rsyncing to temporary stage-area? If "emerge rsync" will be aborted you'll end up with a broken tree - what could be worse? Syncing the tree must be smarter.

Another way to improve the situation is to adapt something better than rsync, something transaction-log-replication based. And even with CVS it is possible (loginfo, commitinfo and similar hooks). But unfortunately many developers are addicted to rsync. So, no way we can change anything here.

I may look like I am wasting time of busy people. However, once the popularity of Gentoo will grow the problem we discuss today will kill many mirrors and it will make many new Gentooers very frustrated. I suggest it's better to think and to do something about it now rather then when it will be too late.

All other distros do not have such a problem. Most of other Linux distros are not source code based and/or they don't rsync. FreeBSD user base is very small, besides, more stable FreeBSDers do not sync twice a day as we do :)
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Wed Jun 04, 2003 3:32 am    Post subject: Reply with quote

panserg wrote:
As I mentioned already, without p2p there is no way to improve the situation. Mirrorselect measures a network, while mirrors are slow becase they are overloaded. But, if people select a better mirror and altogether switch to it - the new one will be dead in no time. We need a way when emerge will choose the mirror from the list *dynamically* after getting the health report of briefly measured performance of all mirror servers and statically published on the web. That's the way to distribute the load smoothly.

Seems like overkill to specifically be measuring each mirror at each emerge attempt (or every x attempts, or via stats on some page), why not just use a round robin dns method? I may have the terminology/concept wrong on this since I've never personally worked with it, but something akin to how they do the portage mirror setup, basically contacting a central point, that sends you off to a random mirror.
For the normal user who doesn' hardcode a mirror, the load gets distributed... downside is I think there is a limit to the number of sites a round robin setup can direct people to. At the very least, in the interim before attempting a p2p setup it would lesson the mirror-killing you're describing.
Related, are the mirrors really getting as hammered as you say? Aside from the occasional issue, I haven't really gotten the impression that the mirrors were getting seriously overloaded- a good working-over, yes, but I wonder if it's at a critical level as you imply.
panserg wrote:
However, how will it help with "emerge rsync" without rsyncing to temporary stage-area? If "emerge rsync" will be aborted you'll end up with a broken tree - what could be worse? Syncing the tree must be smarter.

Another way to improve the situation is to adapt something better than rsync, something transaction-log-replication based. And even with CVS it is possible (loginfo, commitinfo and similar hooks). But unfortunately many developers are addicted to rsync. So, no way we can change anything here.

Could you explain what you mean by the whole 'rsyncing temp stage-area' thing? Are you refering to the issues I mentioned w/ doing the whole serial id thing?
I'd also be curious exactly what you have in mind by 'transaction-log-replication based', ergo example of how?
Personally, and people are free to tear it to shreds, the traffic for an 'emerge sync' could be lessened by syncing against a log of portage additions/deletions/changes. What I could see doing, and aside from the changes needed, I can't see any problems with, is thus-
1) creation of a log of portage changes w/ a possible timestamp of the change. have emerge sync first rsync/whatever method against said file.
2) check the portage tree against the log, downloading/updating as needed based on where the client's portage tree is in comparison to the official gentoo tree.

Downside to this is possibly what you were talking about- this assumes the tree is in a valid state, eg: it completely matches upto a certain point in the change log. One could, and I'd think there is a better method, either maintain some record of what tree changes were successfully completed (commital log), allowing for a crashed update to the tree to be completed. Or...
Have portage keep track of some initial state, and have it work it's way through the log verifying that each transaction was completed- if not, apply the change.
In hindsight, the only thing I'm unsure of with your quote there is the portion about replication.
Also, I wouldn't call it an rsync addiction- rsync is an efficient *general* method for doing versioning over a network connection. In the general case it works quite well; however that isn't saying that in the specific it is the most *optimal* though... so they're likely could be a better method for grabbing the portage tree, but right now rsync works well since nothing else has been implemented.
panserg wrote:
All other distros do not have such a problem. Most of other Linux distros are not source code based and/or they don't rsync. FreeBSD user base is very small, besides, more stable FreeBSDers do not sync twice a day as we do :)

In terms of source based/rsyncing, most other distros are released purely in version snapshots, while gentoo/debian aren't really released in versions (if one were talking about the apps, not the major changes like a gcc release that spurs gentoo v1.4 or what not).
Soo... how do debian and freebsd do their updating/searching? For debian, always seemed like you queried the server whenever attempting something, which (imo) would be a heavier load upon the servers then what we're doing currently- if for every package we wanted to install we had to query a server, that server would be brought to it's knees quite quickly.

I also wonder about how you'd do this p2p setup- I get that it's to be p2p based, but I'm asking specifics to how you'd set up such a thing.
Back to top
View user's profile Send private message
Genone
Retired Dev
Retired Dev


Joined: 14 Mar 2003
Posts: 9530
Location: beyond the rim

PostPosted: Wed Jun 04, 2003 9:15 pm    Post subject: Reply with quote

Sorry, haven't read all posts so maybe this is old information, but if my memory is right carpaski is working on some bittorrent stuff for portage (or wanted to). On the other hand, my memory is very vague :wink:
But at least on the -dev list there were some positive dev-opinions on that issue.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo Chat All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum