Well yes, thanks for *that. Actually, I see 2-4 files uploaded when reporting/updating(probably 2 a "started" and a "finished" entry on both.
But actually most curious about the "Missing Checkpoints File/Directory" - obviously *called by the app yet reported "Not found/missing"
And the *presence of the "checkpoint_debug" diag flag - intuition that problem exists, can be diagnosed and potentially repaired; is what's on my plate here.
Yeah, the question about this "missing checkpoint" pops up a lot. But it's answered by, right? Do you have an idea for a less confusing way for the app to report this?
Well yes, thanks for *that. Actually, I see 2-4 files uploaded when reporting/updating(probably 2 a "started" and a "finished" entry on both.
But actually most curious about the "Missing Checkpoints File/Directory" - obviously *called by the app yet reported "Not found/missing"
And the *presence of the "checkpoint_debug" diag flag - intuition that problem exists, can be diagnosed and potentially repaired; is what's on my plate here.
All this talk about info messages about checkpoints are a red herring regarding the task run times. They are not errors.
If your card really is slower now you'll have to look for other causes:
1. Is it running at the same clock rate as before?
2. Is the rest of the machine running at the same clock rates as before?
3. Is the machine running the same types of tasks other than BRP6 as before?
To break the messages and their meaning down this is how I understand it:
1. Boinc starts a new task for the first time, there is no checkpoint file, so the app writes an informational message to the stderr saying so. <-- Normal
2. You run through the tasks and from your previous logs the app checkpoints normally. <-- Also normal
3. The main analysis is completed and the app moves over to sorting out the results, this is completed so fast that no checkpoint is needed hence the message that no checkpoint was written while it also presents some other statistics on "dirty SumSpec pages." <-- Also normal
4. The app then proceeds to start the 2nd bundled task and the whole thing repeats itself. <-- As the second bundled task is really a new task it does not have a checkpoint and a message saying so is written.
In the log from message140881 you can see these messages:
[13:43:40][4924][INFO ] Continuing work on ../../projects/einstein.phys.uwm.edu PM0021_001B1_105.bin4 at template no. 293663
Where the first tells you that the 1st bundled tasks is already done and the app moves on to the second bundled task. The second message tells you that the app is continuing work from a checkpoint.
All this talk about info messages about checkpoints are a red herring regarding the task run times. They are not errors.
=
I agree with "red herring" analogy - and thanks for the analysis of the messages
=
If your card really is slower now you'll have to look for other causes:
= actually it's the app that's running slower - not my under rated card--
GTX 960 SC 2048 GDDR5 w 8 multiprocs (CUs)/ Direct compute =5.2(shaders);is a Maxwell GM206 chip.
Detect routine in app is a bit Thin IMHO
And(not bashing here) the aging CUDA 3.2 (again, IMHO) is the primary bottleneck
= My card is limited only by 2 things: 1)it's running on a PCIe 2.0x16 Bus; and my CPU is not a Hyperthread- which if it was would activate Maxwells Unified Memory and improve CPU/Memory and GPU comm across the Bus/I/O
=
1. Is it running at the same clock rate as before?<-Yes, stable 1404.8Mhz core clock; 1752.8 Mem clock; mem used=301MB;Load=82%;1.2060Volts;Temp-64C; avg TDP-58%
+ avg CPU usage= (Lasso reports avg 3-4% thruout the run)
2. Is the rest of the machine running at the same clock rates as before?<- Yes i5 2500 - 4 cores/4 Threads at 3.3Ghz (Speedstep off and Turboboost on)cores stable at 58C 100% load -- and App Process Priority set to Above Normal, high I/O, Normal Mem - actually Bitsum Highest in Lasso.
3. Is the machine running the same types of tasks other than BRP6 as before?<-Yes- against/with 4ea SETI v7 cpu tasks(AVX) w avg 24% CPU usage
+ all 4 cores active on both sites.
I run Parkes BRP6 with stock config of 0.2CPUs + 1 NVIDIA GPU (GPU/CUDA apps suspend at SETI-)
=
So again, Thanks for the Looks and app idiosyncratic analyses and explanations.
All good, in furthering my knowledge of "HOW things work" (systemically) and how we interact *with them.
Kudos to the Devs and Admins
And(not bashing here) the aging CUDA 3.2 (again, IMHO) is the primary bottleneck
No, not really. The Devs are looking into using newer build environments, but so far the benefit of CUDA 5.5 has only been in the single digit percentage range if I remember correctly.
Quote:
My card is limited only by 2 things: 1)it's running on a PCIe 2.0x16 Bus; and my CPU is not a Hyperthread- which if it was would activate Maxwells Unified Memory and improve CPU/Memory and GPU comm across the Bus/I/O
The 16x PCIe 2.0 is perfectly fine with the new app. Even slower connections work nicely now. The old app used to be far more talkative and suffered from slower PCIe connections.
And HT would not magically speed up your GPU at Einstein. It would help keeping the GPU busy, though, if all CPU cores are crunching something else (as is the case in your system).
Unified memory is something which has to be used by the app explicitly or at least by the compiler.
Quote:
I run Parkes BRP6 with stock config of 0.2CPUs + 1 NVIDIA GPU
You could increase your Einstein throughput by running 2 WUs (0.2 CPU + 0.5 GPU) concurrently. This might also help avoiding any idle time which may occur because all CPU cores are busy.
My card is limited only by 2 things: 1)it's running on a PCIe 2.0x16 Bus; [...]
The 16x PCIe 2.0 is perfectly fine with the new app. Even slower connections work nicely now. The old app used to be far more talkative and suffered from slower PCIe connections.
MrS
Indeed.
Run times on my Host
Asus P8 MB, Z77 Express chipset, Windows 7
CPU: Intel i5-3570K CPU @ 3.8 GHz
GPU 0: Intel HD 4000
GPU 1: NVIDIA GTX 750Ti PCIe 3 x 16
GPU 2: NVIDIA GTX 750Ti PCIe 2 x 4 <--
I'll get back to you on that stuff MrS and DF1DX - and share my own findings comparatively on these points of discussion.
=
I just received (UPS):
ASUS Z97 M Plus mobo
Intel 730 Series SSDSC2BP240G4R5 2.5" 240GB SATA 6Gb/s MLC
=
Building this out tomorrow; then Win7 DVD Fresh plus 226 Windows Updates; appropriate New Drivers and Tunings; then Data Migration from former SSD to get my Data, Apps and BOINC.
Plan to DVI or HDMI the iGD to my monitor for desktop GFX and crunch only with a Fully enabled GTX 960 SC.
Remount older SSD after Data transfer and remount my 750 Ti SC into this former Host; retune everybody and make it a fulltime cruncher.
=
Wet Memorial Day weekend here---I'll just be building this
=
So after a few days of crunching SETI and Einstein to produce comparative samples = I'll share what differences the new config yields and confirm/deny params of original(former) hypotheses.
I wish to state how happy my GTX660 is with the beta cuda 55 app. Its times are consistently below 4 hrs running 3 at a time vs almost 5 hrs with cuda 32. The first 4 failed with a total run time of less than 30 secs. Since then every thing has validated and my RAC has jumped by 10 k.
this thread is already a bit older - do you still need results?
In an earlier message in this thread, Bikeman indicated that he had enough information to validate the success of the optimizations he designed into the new BRP6 app. This had nothing to do with a change in the version of CUDA, which is a much more more recent development and is unrelated to the previous algorithm optimizations.
Quote:
I should have enough data from my crunching machine in 1 or 2 weeks. It's a GTX 650TI running on a Celeron G530, bus is PCIe 2.
First results show that the runtime is about 20% faster compared to "Binary Radio Pulsar Search (Parkes PMPS XT) v1.52 (BRP6-cuda32-nv270)".
There was a separate thread for recording the improvements (or lack thereof) for NVIDIA GPUs (only) as a result of the change from CUDA32 to CUDA55. If you want to comment or post results, you should use it instead of this one. The consensus seems to be that Kepler and later series do benefit whilst Fermi and earlier don't. On this basis, your figure of 20% for a 650Ti seems about right. This message posted in the CUDA55 thread actually provides data for a 650Ti showing a ~19% improvement. There is also a link there to earlier data from the BRP5 -> BRP6 -> BRP6-Beta transitions (all using the old CUDA32).
Quote:
If results are still needed, I would provide it properly once I have enough data.
It's entirely up to you. I get the feeling that the results and comments in the CUDA55 thread support what the Devs were expecting so I assume they aren't really looking for further confirmation. However, don't let that stop you :-). It's always good to see the results that people get :-).
Well yes, thanks for *that.
)
Well yes, thanks for *that. Actually, I see 2-4 files uploaded when reporting/updating(probably 2 a "started" and a "finished" entry on both.
But actually most curious about the "Missing Checkpoints File/Directory" - obviously *called by the app yet reported "Not found/missing"
And the *presence of the "checkpoint_debug" diag flag - intuition that problem exists, can be diagnosed and potentially repaired; is what's on my plate here.
Yeah, the question about this
)
Yeah, the question about this "missing checkpoint" pops up a lot. But it's answered by, right? Do you have an idea for a less confusing way for the app to report this?
MrS
Scanning for our furry friends since Jan 2002
RE: Well yes, thanks for
)
All this talk about info messages about checkpoints are a red herring regarding the task run times. They are not errors.
If your card really is slower now you'll have to look for other causes:
1. Is it running at the same clock rate as before?
2. Is the rest of the machine running at the same clock rates as before?
3. Is the machine running the same types of tasks other than BRP6 as before?
To break the messages and their meaning down this is how I understand it:
1. Boinc starts a new task for the first time, there is no checkpoint file, so the app writes an informational message to the stderr saying so. <-- Normal
2. You run through the tasks and from your previous logs the app checkpoints normally. <-- Also normal
3. The main analysis is completed and the app moves over to sorting out the results, this is completed so fast that no checkpoint is needed hence the message that no checkpoint was written while it also presents some other statistics on "dirty SumSpec pages." <-- Also normal
4. The app then proceeds to start the 2nd bundled task and the whole thing repeats itself. <-- As the second bundled task is really a new task it does not have a checkpoint and a message saying so is written.
In the log from message140881 you can see these messages:
[13:43:39][4924][INFO ] Output file: '../../projects/einstein.phys.uwm.edu/PM0021_001B1_104_0_0' already exists - skipping pass
[13:43:40][4924][INFO ] Continuing work on ../../projects/einstein.phys.uwm.edu PM0021_001B1_105.bin4 at template no. 293663
Where the first tells you that the 1st bundled tasks is already done and the app moves on to the second bundled task. The second message tells you that the app is continuing work from a checkpoint.
Thanks Holmis (and
)
Thanks Holmis (and MrS)-
All this talk about info messages about checkpoints are a red herring regarding the task run times. They are not errors.
=
I agree with "red herring" analogy - and thanks for the analysis of the messages
=
If your card really is slower now you'll have to look for other causes:
= actually it's the app that's running slower - not my under rated card--
GTX 960 SC 2048 GDDR5 w 8 multiprocs (CUs)/ Direct compute =5.2(shaders);is a Maxwell GM206 chip.
Detect routine in app is a bit Thin IMHO
And(not bashing here) the aging CUDA 3.2 (again, IMHO) is the primary bottleneck
= My card is limited only by 2 things: 1)it's running on a PCIe 2.0x16 Bus; and my CPU is not a Hyperthread- which if it was would activate Maxwells Unified Memory and improve CPU/Memory and GPU comm across the Bus/I/O
=
1. Is it running at the same clock rate as before?<-Yes, stable 1404.8Mhz core clock; 1752.8 Mem clock; mem used=301MB;Load=82%;1.2060Volts;Temp-64C; avg TDP-58%
+ avg CPU usage= (Lasso reports avg 3-4% thruout the run)
2. Is the rest of the machine running at the same clock rates as before?<- Yes i5 2500 - 4 cores/4 Threads at 3.3Ghz (Speedstep off and Turboboost on)cores stable at 58C 100% load -- and App Process Priority set to Above Normal, high I/O, Normal Mem - actually Bitsum Highest in Lasso.
3. Is the machine running the same types of tasks other than BRP6 as before?<-Yes- against/with 4ea SETI v7 cpu tasks(AVX) w avg 24% CPU usage
+ all 4 cores active on both sites.
I run Parkes BRP6 with stock config of 0.2CPUs + 1 NVIDIA GPU (GPU/CUDA apps suspend at SETI-)
=
So again, Thanks for the Looks and app idiosyncratic analyses and explanations.
All good, in furthering my knowledge of "HOW things work" (systemically) and how we interact *with them.
Kudos to the Devs and Admins
RE: And(not bashing here)
)
No, not really. The Devs are looking into using newer build environments, but so far the benefit of CUDA 5.5 has only been in the single digit percentage range if I remember correctly.
The 16x PCIe 2.0 is perfectly fine with the new app. Even slower connections work nicely now. The old app used to be far more talkative and suffered from slower PCIe connections.
And HT would not magically speed up your GPU at Einstein. It would help keeping the GPU busy, though, if all CPU cores are crunching something else (as is the case in your system).
Unified memory is something which has to be used by the app explicitly or at least by the compiler.
You could increase your Einstein throughput by running 2 WUs (0.2 CPU + 0.5 GPU) concurrently. This might also help avoiding any idle time which may occur because all CPU cores are busy.
MrS
Scanning for our furry friends since Jan 2002
RE: RE: My card is
)
Indeed.
Run times on my Host
Asus P8 MB, Z77 Express chipset, Windows 7
CPU: Intel i5-3570K CPU @ 3.8 GHz
GPU 0: Intel HD 4000
GPU 1: NVIDIA GTX 750Ti PCIe 3 x 16
GPU 2: NVIDIA GTX 750Ti PCIe 2 x 4 <--
BRP6 (Parkes PMPS XT v1.52)
Concurrency: 1 * 1 GPU:
GPU 0: ~8:45:00
Concurrency: 2 @ 0.5 CPUs + 0.5 GPUs:
GPU 1: ~4:00:00
GPU 2: ~4:05:00 <--
Only 5 minutes more!
Jürgen.
I'll get back to you on that
)
I'll get back to you on that stuff MrS and DF1DX - and share my own findings comparatively on these points of discussion.
=
I just received (UPS):
ASUS Z97 M Plus mobo
Intel Core i7-4790K Devil’s Canyon Quad-Core 4.0GHz LGA 1150 BX80646I74790K Desktop Processor Intel HD Graphics 4600
HASWELL Hyperthread 8 threads
Intel 730 Series SSDSC2BP240G4R5 2.5" 240GB SATA 6Gb/s MLC
=
Building this out tomorrow; then Win7 DVD Fresh plus 226 Windows Updates; appropriate New Drivers and Tunings; then Data Migration from former SSD to get my Data, Apps and BOINC.
Plan to DVI or HDMI the iGD to my monitor for desktop GFX and crunch only with a Fully enabled GTX 960 SC.
Remount older SSD after Data transfer and remount my 750 Ti SC into this former Host; retune everybody and make it a fulltime cruncher.
=
Wet Memorial Day weekend here---I'll just be building this
=
So after a few days of crunching SETI and Einstein to produce comparative samples = I'll share what differences the new config yields and confirm/deny params of original(former) hypotheses.
Have a good weekend y'all!
I wish to state how happy my
)
I wish to state how happy my GTX660 is with the beta cuda 55 app. Its times are consistently below 4 hrs running 3 at a time vs almost 5 hrs with cuda 32. The first 4 failed with a total run time of less than 30 secs. Since then every thing has validated and my RAC has jumped by 10 k.
Hi, this thread is already
)
Hi,
this thread is already a bit older - do you still need results?
I should have enough data from my crunching machine in 1 or 2 weeks. It's a GTX 650TI running on a Celeron G530, bus is PCIe 2.
First results show that the runtime is about 20% faster compared to "Binary Radio Pulsar Search (Parkes PMPS XT) v1.52 (BRP6-cuda32-nv270)".
If results are still needed, I would provide it properly once I have enough data.
RE: this thread is already
)
In an earlier message in this thread, Bikeman indicated that he had enough information to validate the success of the optimizations he designed into the new BRP6 app. This had nothing to do with a change in the version of CUDA, which is a much more more recent development and is unrelated to the previous algorithm optimizations.
There was a separate thread for recording the improvements (or lack thereof) for NVIDIA GPUs (only) as a result of the change from CUDA32 to CUDA55. If you want to comment or post results, you should use it instead of this one. The consensus seems to be that Kepler and later series do benefit whilst Fermi and earlier don't. On this basis, your figure of 20% for a 650Ti seems about right. This message posted in the CUDA55 thread actually provides data for a 650Ti showing a ~19% improvement. There is also a link there to earlier data from the BRP5 -> BRP6 -> BRP6-Beta transitions (all using the old CUDA32).
It's entirely up to you. I get the feeling that the results and comments in the CUDA55 thread support what the Devs were expecting so I assume they aren't really looking for further confirmation. However, don't let that stop you :-). It's always good to see the results that people get :-).
Cheers,
Gary.