Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
sparc32(sun4m)-SMP, experience or problems with pthreads.
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Gentoo on Sparc
View previous topic :: View next topic  
Author Message
Ferris
Retired Dev
Retired Dev


Joined: 13 Jan 2003
Posts: 426
Location: N. Virginia (USA)

PostPosted: Thu Jun 26, 2003 6:17 pm    Post subject: sparc32(sun4m)-SMP, experience or problems with pthreads. Reply with quote

I am putting this here right now for comments or feedback. I am not close
to much of an analysis.

I am experiencing a problem which manifests arises on a threaded
application on a sparc20-SMP characterized as
===
Linux 2.4.21-sparc-r1 #1 SMP Wed Jun 25 14:11:11 UTC 2003 sparc sun4m Texas Instruments, Inc. - SuperSparc-(II) GNU/Linux
===
This problem is NOT present on a U2 identified as
===
Linux 2.4.20-sparc-r0 #3 SMP Fri Jan 3 15:56:09 UTC 2003 sparc64 sun4u TI UltraSparc II (BlackBird) GNU/Linux
===

Libraries, etc are at same release level, and the software involved is
the same (source, built separately on each system.)

Software involved is a parallel simulation package (DaSSF) which uses
MPI (MPICH) for achieving parallel execution across processors or
systems. The particular test is a little simulation which can use 2 processors if they are available.

1. Using 1 or 2 processors on U2 works fine (threaded or no);
2. Using 1 processor on U2, 1 processor on SS20 works fine;
3. Using 1 processor on SS20 works fine;
4. Using 2 processors on SS20 (unthreaded, like 2 systems) seems to
work fine; BUT
4. Using 2 processors (threaded) gives unpredictable results:
a. SegFault is favorite;
b. Correct results is next favorite;
c. Flat-out wrong results is possible, like
"if(time_delay > 0) {waitFor(time_delay);}
...
ERROR waitFor negative delay"

Some observations:

1. The reason for kernel 2.4.21 is because 2.4.20 exhibits same behavior;
2. I cannot duplicate this with a simple-minded multithreaded test which
uses 100% of both CPUs and sychronizes every now and then;
3. I have not found any mention of such problems regarding MPI or
DaSSL, but the number of people using either on Sparc/Linux is
about 1, counting me.
4. MPI does not use threads, and it is the multiprocessor engine. I
believe it is using sockets. So I believe but have not confirmed that
the failing instance is an example of two threads sychronizing with
pthreads and communicating with sockets. (I don't have a very strong
belief, though.)

I am guessing that I am seeing a problem related to the subject-line
environment, but I say this only because it is such an easy target to
blame. :wink:

Has anyone seen or heard about anything like this? Comments?
Suggestions? Criticisms?

Thanks,
Back to top
View user's profile Send private message
Ferris
Retired Dev
Retired Dev


Joined: 13 Jan 2003
Posts: 426
Location: N. Virginia (USA)

PostPosted: Thu Jun 26, 2003 6:29 pm    Post subject: Reply with quote

Oh, yes, let me add another reason I suspect SMP+threads: When MPI
configures itself, it looks for java to see if it can use it to build some optional pieces. If on the SS20 it finds
java version "1.3.1"
Java(TM) 2 Runtime Environment, Standard Edition (build Blackdown-1.3.1-02b-FCS)
Classic VM (build Blackdown-1.3.1-02b-FCS, native threads, sunwjit)

then either the configure itself, or more likely, the build, eventually hangs
forever in javac (at least, 8hours+ CPU time overnight for a single compile
seems like forever, since it takes nothing significant at all on the U2).

And javac certainly looks like it's pretty heavily threaded when it
runs. (1.4.1 same results).
Back to top
View user's profile Send private message
Ferris
Retired Dev
Retired Dev


Joined: 13 Jan 2003
Posts: 426
Location: N. Virginia (USA)

PostPosted: Fri Jun 27, 2003 7:43 pm    Post subject: Reply with quote

OK, now I can duplicate this pretty consistently with a three-thread program
consisting of two independent threads talking every now and then with
a main thread. I believe that the synchronization is correct, at least it
seems to be on paper, and the program runs correctly forever on my
U2. (On the Ultra2, it "redlines" the CPUs at Load > 2.0; the SS20 runs
at about 1.5, with a lot more system overhead).

[Thread-1: x = sin(x); (local-then-global) lots of times
y = cos(y); (global-then-local) lots of times
Thread-2: x = sin(x); (global-then-local) lots of times
y = cos(y); (local-then-global) lots of times
Main : x = sin(x); (local) lots of times
y = cos(y); (local) lots of times
Every(lots of times) sync-up and make sure global sin&cos
agree with my local sin & cos
(so we are computing -0- and -0.739085...- very inefficiently)
Uses 6 mutexes]

I am not at all sure enough of my own test case to submit a bug report
yet, especially since I don't know where the breakdown lies. I am pretty
sure, though, that no matter how out of sync these threads get,
they should not seg fault every 3sec - 3min or so. :?:

Anyone actually know something about this stuff?
Back to top
View user's profile Send private message
Ferris
Retired Dev
Retired Dev


Joined: 13 Jan 2003
Posts: 426
Location: N. Virginia (USA)

PostPosted: Sat Jun 28, 2003 4:24 pm    Post subject: Reply with quote

Now, we can make it a bug. 99.44% sure this is sufficient on SS20-SMP
to create random SegFault.

1. More than 1 thread;
2. Both (or more) threads doing floating point math (-lm) stuff at the same
time, thus.

a. 2 threads & main just syncing & reporting status now & then: OK;
b. 2 threads doing lots of 'd=random();' & syncing with main are OK;
c. 2 threads both using the same math function (sin, cos) will 100%
SegFault.
d. 3 or more threads *seems* OK
This is a never-fail on U2.

So, I'll file a bug report, although I still have no idea who is going wrong
here (after all, 0.56% *me* by my own estimation :wink: )
Back to top
View user's profile Send private message
Ferris
Retired Dev
Retired Dev


Joined: 13 Jan 2003
Posts: 426
Location: N. Virginia (USA)

PostPosted: Sun Jun 29, 2003 3:54 pm    Post subject: Reply with quote

For those of you who have been hanging on this for the next installment,
it continues at bug #23649...
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo on Sparc All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum