Refused: result already reported as success

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1960

Credit: 1517589201

RAC: 1740952

10 Aug 2015 18:44:57 UTC

Topic 198186

(moderation:

)

8/10/2015 6:42:55 AM | Einstein@Home | Completed result PM0040_01211_382_0 refused: result already reported as success

I just got 5 in a row on the same host doing the BRP6 GPU tasks.

Never noticed these before.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

Refused: result already reported as success

10 Aug 2015 20:53:48 UTC

Message 133239

(moderation:

)

It happens when Boinc reports completed tasks to the server and the servers acknowledgment is somehow lost so Boinc doesn't know the report got through. Boinc then tries to report the tasks again but as they've already been reported the server refuses the report and tells Boinc that they're already reported as success. Boinc now gets the message and deletes the tasks.

No harm done except for some extra network traffic and some extra log messages.
Everything else should play out normally.

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1960

Credit: 1517589201

RAC: 1740952

Well for some reason it was

11 Aug 2015 6:19:25 UTC

Message 133240 in response to message 133239

(moderation:

)

Well for some reason it was doing that with tasks that were not due until the 19th

So after I posted that I just decided to reset the project and I have it running again with about just 10 tasks yet when I check that one http://einsteinathome.org/host/12001188/tasks it still says it has 47 in progress.

I thought Boinc would remove those old one soon after a reset.

And none of those on the *In progress * list are actually the ones it has now.

The ones it has now all say due on the 24th

And it keeps getting more *resent* tasks now when it is set to not get more.

And the 2 tasks it just finished said they were already reported as errors but none of these errors are on the hosts error list or invalid list.

Not even sure I should let it keep running since the project reset didn't work the usual way and now each finished task gets "already reported as error" and "already reported as success" on the log page but nothing on the "tasks for computer" page.

Or just let it run 24/7 doing all these tasks and keep getting these "resent" tasks until some day getting rid of all of them.........which makes no sense.

I have had these 6 hosts running for years now and never seen this.

Even the valids it has listed make no sense because the computer was down from June 11 until August 8th

(no problems with any other projects)

Oy vey

edit: I just took a look at this other host and ALL those aborted tasks are put on this one
http://einsteinathome.org/host/12001170/tasks&offset=40&show_names=1&state=0&appid=0

It also says BOTH computers were created on the 3rd.

The one having the problems was changed to Windows 10 on the 3rd but the other one has not been changed and has been running for several years with no problem.

And when I check that same host over at vLHC it says it was created on 25 Mar 2014, which is because I did a clean install of Windows 7 that day just to make sure it continued running perfect.

That makes no sense either.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

Because of this projects old

11 Aug 2015 7:37:28 UTC

Message 133241 in response to message 133240

(moderation:

)

Because of this projects old server software it will resend lost tasks even though you have selected "No new tasks" in Boinc and not requesting new work. Not much you can do about that.

Have you copied Boinc's data directory between these host recently?

The last contact log for host#12001188 just states resending 12 tasks but the log for host#12001170 looks strange:

Quote:

2015-08-11 07:06:10.7645 [PID=32449] Request: [USER#xxxxx] [HOST#12001170] [IP xxx.xxx.xxx.54] client 7.4.42
2015-08-11 07:06:10.7680 [PID=32449] [handle] [HOST#12001170] [RESULT#512281141] [WU#223363158] got result (DB: server_state=5 outcome=4 client_state=0 validate_state=0 delete_state=0)
2015-08-11 07:06:10.7680 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512281141] [WU#223363158] got result from wrong host; expected [HOST#12001188]
2015-08-11 07:06:10.7686 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512281141] [WU#223363158] Allowing result because same USER#3123
2015-08-11 07:06:10.7686 [PID=32449] [handle] cpu time 0.000000 credit/sec 0.005236, claimed credit 0.000000
2015-08-11 07:06:10.7686 [PID=32449] [handle] [RESULT#512281141] [WU#223363158]: client_state 6 exit_status 200; setting outcome ERROR
2015-08-11 07:06:10.7686 [PID=32449] [handle] [HOST#12001170] [RESULT#512279204] [WU#223363397] got result (DB: server_state=4 outcome=0 client_state=0 validate_state=0 delete_state=0)
2015-08-11 07:06:10.7686 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512279204] [WU#223363397] got result from wrong host; expected [HOST#12001188]
2015-08-11 07:06:10.7694 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512279204] [WU#223363397] Allowing result because same USER#3123
2015-08-11 07:06:10.7694 [PID=32449] [handle] cpu time 0.000000 credit/sec 0.005236, claimed credit 0.000000
2015-08-11 07:06:10.7694 [PID=32449] [handle] [RESULT#512279204] [WU#223363397]: client_state 6 exit_status 200; setting outcome ERROR
2015-08-11 07:06:10.7694 [PID=32449] [handle] [HOST#12001170] [RESULT#512291510] [WU#224381224] got result (DB: server_state=4 outcome=0 client_state=0 validate_state=0 delete_state=0)
2015-08-11 07:06:10.7694 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512291510] [WU#224381224] got result from wrong host; expected [HOST#12001188]
2015-08-11 07:06:10.7699 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512291510] [WU#224381224] Allowing result because same USER#3123
2015-08-11 07:06:10.7699 [PID=32449] [handle] cpu time 0.000000 credit/sec 0.005236, claimed credit 0.000000
2015-08-11 07:06:10.7699 [PID=32449] [handle] [RESULT#512291510] [WU#224381224]: client_state 6 exit_status 200; setting outcome ERROR
2015-08-11 07:06:10.7699 [PID=32449] [handle] [HOST#12001170] [RESULT#512291506] [WU#224381222] got result (DB: server_state=4 outcome=0 client_state=0 validate_state=0 delete_state=0)
2015-08-11 07:06:10.7699 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512291506] [WU#224381222] got result from wrong host; expected [HOST#12001188]
2015-08-11 07:06:10.7705 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512291506] [WU#224381222] Allowing result because same USER#3123
2015-08-11 07:06:10.7705 [PID=32449] [handle] cpu time 0.000000 credit/sec 0.005236, claimed credit 0.000000
2015-08-11 07:06:10.7705 [PID=32449] [handle] [RESULT#512291506] [WU#224381222]: client_state 6 exit_status 200; setting outcome ERROR
2015-08-11 07:06:10.7705 [PID=32449] [handle] [HOST#12001170] [RESULT#512267793] [WU#223363122] got result (DB: server_state=4 outcome=0 client_state=0 validate_state=0 delete_state=0)
2015-08-11 07:06:10.7705 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512267793] [WU#223363122] got result from wrong host; expected [HOST#12001188]
2015-08-11 07:06:10.7710 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512267793] [WU#223363122] Allowing result because same USER#3123
2015-08-11 07:06:10.7710 [PID=32449] [handle] cpu time 0.000000 credit/sec 0.005236, claimed credit 0.000000
2015-08-11 07:06:10.7710 [PID=32449] [handle] [RESULT#512267793] [WU#223363122]: client_state 6 exit_status 200; setting outcome ERROR
2015-08-11 07:06:10.7710 [PID=32449] [handle] [HOST#12001170] [RESULT#512291512] [WU#224381225] got result (DB: server_state=4 outcome=0 client_state=0 validate_state=0 delete_state=0)
2015-08-11 07:06:10.7710 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512291512] [WU#224381225] got result from wrong host; expected [HOST#12001188]
2015-08-11 07:06:10.7715 [PID=32449] [CRITICAL] [HOST#12001170] [RESULT#512291512] [WU#224381225] Allowing result because same USER#3123
2015-08-11 07:06:10.7715 [PID=32449] [handle] cpu time 0.000000 credit/sec 0.005236, claimed credit 0.000000
2015-08-11 07:06:10.7716 [PID=32449] [handle] [RESULT#512291512] [WU#224381225]: client_state 6 exit_status 200; setting outcome ERROR

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1960

Credit: 1517589201

RAC: 1740952

RE: Because of this

11 Aug 2015 17:40:48 UTC

Message 133242 in response to message 133241

(moderation:

)

Quote:

Because of this projects old server software it will resend lost tasks even though you have selected "No new tasks" in Boinc and not requesting new work. Not much you can do about that.

Have you copied Boinc's data directory between these host recently?

The last contact log for host#12001188 just states resending 12 tasks but the log for host#12001170 looks strange:

No never have copied between hosts here before.

I just checked and it still is running and completing and still no credits or even errors other than on the event log saying "Refused: result already reported as success"

Only having this problem running GPU's here on this one and all of them run BRP6,vLHC,Atlas,and CMS-dev with no problems.

Never heard of this happening here before.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7383561687

RAC: 2059175

RE: And it keeps getting

11 Aug 2015 18:22:17 UTC

Message 133243 in response to message 133240

(moderation:

)

Quote:

And it keeps getting more *resent* tasks now when it is set to not get more.

And what setting is that? I don't think there is such a setting. (about not getting resends). Other than not contacting the server at all, of course.

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1960

Credit: 1517589201

RAC: 1740952

RE: RE: And it keeps

11 Aug 2015 18:27:10 UTC

Message 133244 in response to message 133243

(moderation:

)

Quote:

Quote:
And it keeps getting more *resent* tasks now when it is set to not get more.

And what setting is that? I don't think there is such a setting. (about not getting resends). Other than not contacting the server at all, of course.

.... set to not get more tasks

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1960

Credit: 1517589201

RAC: 1740952

2015-08-11 18:44:08.5128

11 Aug 2015 18:59:48 UTC

Message 133245

(moderation:

)

2015-08-11 18:44:08.5128 [PID=32605] Request: [USER#xxxxx] [HOST#12001170] [IP xxx.xxx.xxx.54] client 7.4.42
2015-08-11 18:44:08.5141 [PID=32605] [handle] [HOST#12001170] [RESULT#512293394] [WU#224382161] got result (DB: server_state=4 outcome=0 client_state=0 validate_state=0 delete_state=0)
2015-08-11 18:44:08.5142 [PID=32605] [CRITICAL] [HOST#12001170] [RESULT#512293394] [WU#224382161] got result from wrong host; expected [HOST#12001188]
2015-08-11 18:44:08.5146 [PID=32605] [CRITICAL] [HOST#12001170] [RESULT#512293394] [WU#224382161] Allowing result because same USER#3123
2015-08-11 18:44:08.5146 [PID=32605] [handle] cpu time 1403.603000 credit/sec 0.005236, claimed credit 7.348642
2015-08-11 18:44:08.5148 [PID=32605] [handle] [RESULT#512293394] [WU#224382161]: setting outcome SUCCESS
2015-08-11 18:44:08.5191 [PID=32605] [send] effective_ncpus 4 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2015-08-11 18:44:08.5191 [PID=32605] [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2015-08-11 18:44:08.5191 [PID=32605] [send] Not using matchmaker scheduling; Not using EDF sim
2015-08-11 18:44:08.5191 [PID=32605] [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2015-08-11 18:44:08.5191 [PID=32605] [send] CUDA: req 0.00 sec, 0.00 instances; est delay 0.00
2015-08-11 18:44:08.5191 [PID=32605] [send] work_req_seconds: 0.00 secs
2015-08-11 18:44:08.5191 [PID=32605] [send] available disk 86.69 GB, work_buf_min 864000
2015-08-11 18:44:08.5191 [PID=32605] [send] active_frac 0.999964 on_frac 0.996171 DCF 0.610334
2015-08-11 18:44:08.5242 [PID=32605] Sending reply to [HOST#12001170]: 0 results, delay req 60.00
2015-08-11 18:44:08.5252 [PID=32605] Scheduler ran 0.016 seconds

And on the other

2015-08-11 18:36:27.2597 [PID=29595] Request: [USER#xxxxx] [HOST#12001188] [IP xxx.xxx.xxx.54] client 7.4.42
2015-08-11 18:36:27.2617 [PID=29595] [handle] [HOST#12001188] [RESULT#512245972] [WU#224362352] got result (DB: server_state=5 outcome=3 client_state=6 validate_state=2 delete_state=0)
2015-08-11 18:36:27.2617 [PID=29595] [CRITICAL] [HOST#12001188] [RESULT#512245972] [WU#224362352] result already over [outcome=3 validate_state=2]: result already reported as error
2015-08-11 18:36:27.2617 [PID=29595] [handle] [HOST#12001188] [RESULT#512245746] [WU#224362240] got result (DB: server_state=5 outcome=3 client_state=6 validate_state=2 delete_state=2)
2015-08-11 18:36:27.2617 [PID=29595] [CRITICAL] [HOST#12001188] [RESULT#512245746] [WU#224362240] result already over [outcome=3 validate_state=2]: result already reported as error
2015-08-11 18:36:27.2618 [PID=29595] [send] effective_ncpus 4 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2015-08-11 18:36:27.2618 [PID=29595] [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2015-08-11 18:36:27.2618 [PID=29595] [send] Not using matchmaker scheduling; Not using EDF sim
2015-08-11 18:36:27.2618 [PID=29595] [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2015-08-11 18:36:27.2618 [PID=29595] [send] CUDA: req 0.00 sec, 0.00 instances; est delay 0.00
2015-08-11 18:36:27.2618 [PID=29595] [send] work_req_seconds: 0.00 secs
2015-08-11 18:36:27.2618 [PID=29595] [send] available disk 89.49 GB, work_buf_min 0
2015-08-11 18:36:27.2618 [PID=29595] [send] active_frac 0.999827 on_frac 0.810380 DCF 0.940879
2015-08-11 18:36:27.2671 [PID=29595] [debug] [HOST#12001188] MSG(high) Completed result PM0040_011A1_332_1 refused: result already reported as error
2015-08-11 18:36:27.2672 [PID=29595] [debug] [HOST#12001188] MSG(high) Completed result PM0040_011A1_108_0 refused: result already reported as error
2015-08-11 18:36:27.2672 [PID=29595] Sending reply to [HOST#12001188]: 0 results, delay req 60.00
2015-08-11 18:36:27.2682 [PID=29595] Scheduler ran 0.013 seconds

There 2 have a slightly different CPU a different GPU and a different OS

The only thing the same is they sit in the same room

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

This seems like a true

11 Aug 2015 19:21:28 UTC

Message 133246 in response to message 133242

(moderation:

)

This seems like a true mystery to me.

In the contact log for host#12001188 recorded at 11 Aug 2015 18:36:27 UTC the following tasks are reported:
RESULT#512245972 belonging to WU#224362352
RESULT#512245746 belonging to WU#224362240

Both are rejected as "result already reported as error". Looking through the task list for host#12001188 those WU are nowhere to be found.

Here's when it gets interesting, looking through the task list for host#12001170 those WU can be found with the following info:

Quote:

512245972 224362352 10 Aug 2015 18:45:27 UTC 11 Aug 2015 1:21:55 UTC Aborted by user 0.00 0.00 0.00 --- Binary Radio Pulsar Search (Parkes PMPS XT) v1.57 (BRP6-Beta-cuda55)
512245746 224362240 10 Aug 2015 18:45:26 UTC 11 Aug 2015 1:05:50 UTC Aborted by user 0.00 0.00 0.00 --- Binary Radio Pulsar Search (Parkes PMPS XT) v1.57 (BRP6-Beta-cuda55)

So they were already reported and probably with messages like this in the contact log:
2015-08-11 18:44:08.5142 [PID=32605] [CRITICAL] [HOST#12001170] [RESULT#512293394] [WU#224382161] got result from wrong host; expected [HOST#12001188]
2015-08-11 18:44:08.5146 [PID=32605] [CRITICAL] [HOST#12001170] [RESULT#512293394] [WU#224382161] Allowing result because same USER#3123

How does host#12001170 manage to report tasks that was scheduled to host#12001188?
And why does the server accept them? Shouldn't the server discard the report?
Maybe even issue a new host number as something is clearly wrong?

To clear up this mess I'd probably set both host to "No new tasks", abort all tasks belonging to Einstein, hit the "Update" button on the projects tab to report the aborted tasks and when the communication was done remove the project from both hosts. Then I'd add the project back on one of them checking what's going on in the task list here on the website for that host and the event log in Boinc, once satisfied that it's working as it should I'd add the project to the other host again and once again keep a close eye on it.

Edit:
Some further thoughts/questions:
Could it be that the tasks where first scheduled to host#12001188 but when the report came in from host#12001170 the server moved the tasks over in the online database?
How can two different hosts even know anything about each others task unless some fancy editing of client_stat.xml and copying of data is going on?

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1960

Credit: 1517589201

RAC: 1740952

It is a mystery to me too

11 Aug 2015 20:06:05 UTC

Message 133247

(moderation:

)

It is a mystery to me too Holmis

So far I just did the *reset the project* yesterday on the one with the problem and that sure didn't work.

So maybe I should try aborting each task and then starting over as you mentioned.

I never change the files on any of my 6 hosts here and the other ones still work.

The only way they are connected is the wireless modem and a D-link and those 2 are not even plugged in the same box (one into the modem and the other into the D-link and a couple are plugged in that way so that sure isn't the reason)

So in about an hour that host will finish its 24hr vLHC tasks so then I won't mind going as far as reinstalling Boinc and starting over with all the files I would have to d/l for the projects on my slow dsl.

Thanks Holmis and wish me luck !!

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

Don't think a full

11 Aug 2015 20:21:38 UTC

Message 133248 in response to message 133247

(moderation:

)

Don't think a full reinstallation is necessary. Removing the project should remove all files associated with it and aborting and reporting the tasks before removing should clear everything on the server.

If it doesn't work one could always edit the client_state.xml and reduce the "rpc_seqno" for the project so that the server issues a new host number.

Either way I do wish you good luck! =)

Refused: result already reported as success

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner