Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Hardware for Octave on Gentoo
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
grosmano
n00b
n00b


Joined: 02 Jul 2012
Posts: 12

PostPosted: Tue Jan 15, 2013 6:26 am    Post subject: Hardware for Octave on Gentoo Reply with quote

Hello,

I am considering getting a new desktop computer to replace the one i am currently using which is now running Octave programs 24h a day. So i would like to select my new motherboard and processor according to this application. The rest is conventionnal office needs so i suppose i don't need to take it into account: no gaming or multimedia, no 3D rendering either.

From what i could observe i am not limited by ram but rather by the speed of the cpu. So i suppose in my case the higher the frequency of the cpu the faster the calculation will be but what i don't know is if some other parameter might become the limiting factor: maybe some feature of the processor or the motherboard, like a specific bus speed ?
I am also not sure if it is a good idea in my case to multiply the number of cores, from what i understood it should not especially help a single demanding program to run faster. Maybe it could become useful if i consider running several of them in parallel ? This might be an option if i want to change some input parameter in the program.

Thanks for reading this post, a simple confirmation of these assumptions or corrections on what i should really focus on would be greatly appreaciated.
Back to top
View user's profile Send private message
Hypnos
Advocate
Advocate


Joined: 18 Jul 2002
Posts: 2861
Location: Omnipresent

PostPosted: Tue Jan 15, 2013 6:43 am    Post subject: Reply with quote

That's the usual starting point for scientific computation: parallelize what can be parallelized, have the fastest CPU core you can afford for each process, and install enough RAM for all of them with the fastest bus speed you can afford.
_________________
Personal overlay | Simple backup scheme
Back to top
View user's profile Send private message
aCOSwt
Moderator
Moderator


Joined: 19 Oct 2007
Posts: 2389
Location: Hilbert space

PostPosted: Tue Jan 15, 2013 8:03 am    Post subject: Re: Hardware for Octave on Gentoo Reply with quote

grosmano wrote:
I am also not sure if it is a good idea in my case to multiply the number of cores, from what i understood it should not especially help a single demanding program to run faster.

It will depend on the libraries your single demanding program is linked to.

As an example, blas is likely to be one of them.
If you preferred the threaded blas-atlas implementation then the more cores you get, the faster your single demanding program is.
Because this one is now masked on Gentoo, you might well rely on the blas-reference which AFAIK is not threaded => Increasing the number of cores won't help much.
I use the libshogun with octave and the same goes with shogun depending on how you built it (depending on the setting of --enable-hmm-parallel)

I basically agree with Hypnos (parallelize as much as you can). If you can't then your single-demanding-program won't benefit from increasing the number of cores. If you can then it will significantly.

EDIT : In any case, prefer the CPUs offering the hugest caches.
_________________
Back to top
View user's profile Send private message
grosmano
n00b
n00b


Joined: 02 Jul 2012
Posts: 12

PostPosted: Tue Jan 15, 2013 7:59 pm    Post subject: Reply with quote

Thank you Hypnos and aCOSwt ! So i will focus first on cpu frequency, bus speed and cache size. Then as a function of the number of cores i would run multiple copies of the same program with different parameters.

aCOSwt wrote:
grosmano wrote:
I am also not sure if it is a good idea in my case to multiply the number of cores, from what i understood it should not especially help a single demanding program to run faster.

It will depend on the libraries your single demanding program is linked to.

As an example, blas is likely to be one of them.
If you preferred the threaded blas-atlas implementation then the more cores you get, the faster your single demanding program is.
Because this one is now masked on Gentoo, you might well rely on the blas-reference which AFAIK is not threaded => Increasing the number of cores won't help much.
I use the libshogun with octave and the same goes with shogun depending on how you built it (depending on the setting of --enable-hmm-parallel)

[...]

I didn't know about this possibility to build threaded libraries, it should be very interesting for me in some cases. Indeed, blas-reference is the one which is currently installed on my system. So it seems the next step would be checking among Octave's dependencies which libraries are the ones that provide the functions i need most and see if some of them can be build with an --enable-hmm-parallel -like parameter. All this is rather new for me so i will first use the not threaded versions and parallelize "by hand" but i will start to have a look at all this and try to switch in the future.
Back to top
View user's profile Send private message
krinn
Advocate
Advocate


Joined: 02 May 2003
Posts: 3937

PostPosted: Wed Jan 16, 2013 12:18 am    Post subject: Reply with quote

Well, speaking about cores, intel and amd have add a frequency boost when the cores are idle. So as of today, you should then really care how you will use your core. Because if it's to use them doing multi run of your program, it's ok, it will do better than one core. But if you feed other cores with stupid things, you will then in fact lost speed.
You should look for one cpu that handle this (i7 or newer) and for amd i don't really know but it will be easy to find the amd cpu that handle that feature too.
intel name it turbo boost
amd name it turbo core
http://www.pcauthority.com.au/Feature/173700,pc-building-intels-turbo-boost-vs-amds-turbo-core.aspx
Back to top
View user's profile Send private message
grosmano
n00b
n00b


Joined: 02 Jul 2012
Posts: 12

PostPosted: Thu Jan 17, 2013 12:59 am    Post subject: Reply with quote

Thank you krinn, i will pay attention to this too.
Back to top
View user's profile Send private message
aCOSwt
Moderator
Moderator


Joined: 19 Oct 2007
Posts: 2389
Location: Hilbert space

PostPosted: Thu Jan 17, 2013 6:49 am    Post subject: Reply with quote

grosmano wrote:
I didn't know about this possibility to build threaded libraries, it should be very interesting for me in some cases. Indeed, blas-reference is the one which is currently installed on my system.

So what you can easily do (if you already get a multicore CPU) in order to know if, in your particular case, you can expect any significant win by increasing the number of cores is :

1/ Unmask the sci-libs/lapack-atlas and sci-libs/blas-atlas packages
(They are masked in profile because of some "fragile build and runtime behaviour" that I never experimented.)
If you don't know how to unmask such a package, just ask)

2/ emerge sci-libs/lapack-atlas
(Warning ! Because of the optimization process of the code, you *must* have cpu throttling disabled when emerging. (set your governor to performance)
Additionally, the build time is quite huge (not far from 3 hours on my core II duo)

3/ eselect blas list and eselect lapack list should now show several possible choices for those libs.
On both, eselect the threaded atlas choice.

Now start your single demanding program and conclude !
_________________
Back to top
View user's profile Send private message
grosmano
n00b
n00b


Joined: 02 Jul 2012
Posts: 12

PostPosted: Mon Jan 21, 2013 11:49 pm    Post subject: Reply with quote

aCOSwt wrote:

So what you can easily do (if you already get a multicore CPU) in order to know if, in your particular case, you can expect any significant win by increasing the number of cores is :

1/ Unmask the sci-libs/lapack-atlas and sci-libs/blas-atlas packages
(They are masked in profile because of some "fragile build and runtime behaviour" that I never experimented.)
If you don't know how to unmask such a package, just ask)

2/ emerge sci-libs/lapack-atlas
(Warning ! Because of the optimization process of the code, you *must* have cpu throttling disabled when emerging. (set your governor to performance)
Additionally, the build time is quite huge (not far from 3 hours on my core II duo)

3/ eselect blas list and eselect lapack list should now show several possible choices for those libs.
On both, eselect the threaded atlas choice.

Now start your single demanding program and conclude !

Here are the results i obtained after following your indications on a Core2 Duo and running a few tests with two typical programs i use. I took care of having only one session open with octave running and xdm was stopped.

The first test was with an iterative process in which a 100 x 100 array is scanned in a random order. At each iteration every value in the array is changed as a function of the values of the neighbours. This involves mainly tests and value assignements, i did not see any improvement for this case.

The second test requires the inversion of several 2,000 x 2,000 complex matrices. Here i was impressed because with atlas-threads for blas and atlas for lapack it takes 45 % of the time it takes with reference selected for both (time values are 1h15 vs 2h45, unlike for blas i don't have atlas-threads in the eselect list for lapack). It is surprising to me because i thought decreasing the time by the number of cores would have been an ideal and unattainable case but the ratio is actually higher than 2, which is really good news. And the biggest surprise had come at first when i had forgotten to change the link to the lapack library: with atlas-threads for blas and reference for lapack, the time for the same calculation is only 22 min, a more than 85% decrease ! On the other side, the first test is significantly slower in this configuration (+25%), that's a high interaction between the two settings.
Back to top
View user's profile Send private message
aCOSwt
Moderator
Moderator


Joined: 19 Oct 2007
Posts: 2389
Location: Hilbert space

PostPosted: Tue Jan 22, 2013 7:12 am    Post subject: Reply with quote

grosmano wrote:
Here i was impressed because with atlas-threads for blas and atlas for lapack it takes 45 % of the time it takes with reference selected for both (time values are 1h15 vs 2h45

Hey... finally... you might not been in a need to buy some new hardware... :D
grosmano wrote:
It is surprising to me because i thought decreasing the time by the number of cores would have been an ideal and unattainable case but the ratio is actually higher than 2

Take care with this immediate interpretation. Of course, the ratio is > 2 because you understand that threaded-blas-atlas is not exactly blas-reference's code threaded.
It is much more than this. Atlas implements a huge number of optimizations over blas-reference and all the win you experience is not only thanks to the split of the tasks over your 2 cores.

An additional interesting experiment would be to make the same test (The second one) but now selecting the non-threaded-atlas library for blas and compare.
_________________
Back to top
View user's profile Send private message
grosmano
n00b
n00b


Joined: 02 Jul 2012
Posts: 12

PostPosted: Thu Jan 24, 2013 12:44 am    Post subject: Reply with quote

aCOSwt wrote:
An additional interesting experiment would be to make the same test (The second one) but now selecting the non-threaded-atlas library for blas and compare.

I tried and obtained the same result (~3% faster with the non threaded atlas). Seems like the gain i observe is only due to the improvements of the library itself but i don't take any benefits of threading. In a certain manner, this makes the difference between blas-reference and blas-atlas even bigger. I think i will run it again twice in parallel with the non-threaded atlas and check whether the time per matrix inversion is still in the same order of magnitude or >~ twice. If it is twice or more i guess i could conclude that another parameter is limiting, like maybe cache size.

aCOSwt wrote:

Hey... finally... you might not been in a need to buy some new hardware... :D

This was actually my computer at home whereas i will change the one in my office (slightly higher frequency and single core) but it will already help since i often do old-scool parallelizing and run calculations on this one as well. In the meantime i think i will keep the reference libraries at work since i will be almost exclusively running "type one" programs there. (no, no, i am not looking for excuses to change the material anyway :) )
Back to top
View user's profile Send private message
grosmano
n00b
n00b


Joined: 02 Jul 2012
Posts: 12

PostPosted: Sun Jan 27, 2013 8:42 pm    Post subject: Reply with quote

Here are some new data obtained after a few more "type 2" tests. First the test i was considering above, consisting in two matrix inversions in parallel with unthreaded atlas. This took 1.9 times the time of a single inversion, which is not >= 2 but close enough for me to believe the limit is another parameter than the number of operations the cpu can perform in a certain time.

So i tried again with an 8 times smaller matrix:
- a single run with unthreaded atlas is then 100 times faster. I have no particular reason to believe inversion time would be proportional to matrix size but this is a huge difference that makes me believe the limit i observed before is not reached yet with this matrix size.
- two runs in parallel with unthreaded atlas still take the same time, instead of being two times slower. It tends to confirm that the previous limit is not reached.
- but a single run with threaded atlas also takes the same time. So it seems to me that either something is wrong with my system (kernel config ? USE flags ? anything else ?) so that threading would somehow not be supported or that there is a second physical limit that prevents performance improvement with the use of the two cores.
Back to top
View user's profile Send private message
juantxorena
Apprentice
Apprentice


Joined: 19 Mar 2006
Posts: 197
Location: The Shire

PostPosted: Wed Jan 30, 2013 10:01 pm    Post subject: Reply with quote

A comment: although I agree that the masking of blas-atlas and lapack-atlas was a bit rushed, in the science overlay there is the future portage version of it. Is a single package, sci-libs/atlas, and is a newer version than the masked ones. Since you seem to be building a scientific computer, you may be interested in the science overlay as a whole.
_________________
I cannot write English very well. Please, correct any mistake so that I can improve.
Back to top
View user's profile Send private message
grosmano
n00b
n00b


Joined: 02 Jul 2012
Posts: 12

PostPosted: Sat Feb 02, 2013 10:07 pm    Post subject: Reply with quote

juantxorena wrote:
A comment: although I agree that the masking of blas-atlas and lapack-atlas was a bit rushed, in the science overlay there is the future portage version of it. Is a single package, sci-libs/atlas, and is a newer version than the masked ones. Since you seem to be building a scientific computer, you may be interested in the science overlay as a whole.

Thank you for your comment ! I built sci-libs/atlas from science overlay with threads use flag enabled and could observe further improvement versus the unmasked lapack-atlas and blas-atlas (about 25% faster for the last kind of test i made, i haven't tried for the others so far). This time i can see atlas-threads for both lapack and blas eselect lists. I still don't observe any significant difference whether i run a single matrix inversion or two at the same time though but anyway there is improvement.

I didn't know about this "threads" use flag so it is not set for the whole system, i suppose it could be a reason why i don't see a difference :oops:
Back to top
View user's profile Send private message
juantxorena
Apprentice
Apprentice


Joined: 19 Mar 2006
Posts: 197
Location: The Shire

PostPosted: Sat Feb 02, 2013 10:45 pm    Post subject: Reply with quote

I've found this document in the ATLAS webpage about threading: http://math-atlas.sourceforge.net/timing/newThr395/index.html
It seems that the threading helps only in big problems, but it's counter-productive in small problems, which may be your case.

And also another suggestion: since your work seems to be quite dependent on the blas and lapack libraries, you may want to try another implementations. In portage (in the science overlay in particular), there are: the reference ones, mkl for intel processors, acml for AMD processors, atlas and goto (gotoblas2 in portage). In my experience, mkl is a PITA to install; I don't know about acml; and goto, according to some sources, is the fastest of them all. In the case of goto you have to use lapack-reference, but there won't be a speed penalty: lapack make heavy use of the blas functions, and its speed depends directly on the speed of the blas library. Even atlas uses the reference version of most of the lapack functions.
_________________
I cannot write English very well. Please, correct any mistake so that I can improve.
Back to top
View user's profile Send private message
krinn
Advocate
Advocate


Joined: 02 May 2003
Posts: 3937

PostPosted: Sun Feb 03, 2013 11:20 am    Post subject: Reply with quote

you could try prll (in portage tree last time i check), this thing do wonder for // process
http://forums.gentoo.org/viewtopic-t-813383-highlight-prll.html
Back to top
View user's profile Send private message
mir3x
Tux's lil' helper
Tux's lil' helper


Joined: 02 Jun 2012
Posts: 91

PostPosted: Mon Feb 04, 2013 1:02 pm    Post subject: Reply with quote

To check if your system is good configured you can download live scientific linux (Im not 100% sure if there is octave in live but it should be, link -> http://ftp1.scientificlinux.org/linux/scientific/livecd/62/x86_64/) and make 1 more test.
_________________
Installation aborted to prevent system self-destruction
Back to top
View user's profile Send private message
grosmano
n00b
n00b


Joined: 02 Jul 2012
Posts: 12

PostPosted: Mon Mar 04, 2013 11:23 pm    Post subject: Reply with quote

Sorry, i am replying very late. I tried prll, it indicated that only one cpu is used even if i try to run something like "prll -c 2 octave relative_path_to/mytest" and i don't really understand why. I will stop spending time on this issue with my laptop for a while though, since i should receive the new computer soon (most likely an Intel Xeon E5-1620 on a DX79TO). So i will update this thread as soon as tests will be run on that one !
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum