×

To be able to write in the forum you need to authenticate. Meanwhile it's read-only.

Updated flow for running checkers

Updated flow for running checkers
January 10, 2018 04:01PM
I have been working for two days on updating the flow for running checkers.

As it works today, the system can run hundreds of checkers simultaneous, at least it starts them. The problem is that the dedicated virtual machine only has 4 cores and 8 GB of RAM (was 4 GB a few days ago). It was quite easy to OOM (out of memory) the system.

The new design will be that when the user clicks the "Run checker" button, he will be added to a queue, waiting for a slot. I assume the normal case will be that a slot is available right away, or at least within a few seconds.

Below the profile-name-input-box there will be a few progress lines, so that the user can feel that something is happening. It will look like this:


It's still a work in progress. From what I can see in the development, it works fine now. But it does not check that the user actually is first in queue before accepting a run on a checker. There are some other "minor" things to fix as well. But the queue system itself is in place, and the UI.

While changing this behavior I realized that we will now know more about the resources available in the virtual machine, therefore we are increasing max execution time, from 30 seconds to 60 seconds. Memory usage will stay at 1 GB for now, it's not impossible that it will change in the future..

I have not yet tested this, I am unsure if I need to change the code in more places than I have done (found 1 place). You will notice when it's live since the UI will change as the above screenshot then.

There might be places in the documentation that needs to be updated regarding the 60 seconds, if you see such place, feel free to link it in this thread.

One user can only queue for one slot at a time. If using multiple tabs where one has pressed Run checker in all of them, they will all show the same queue position. One of the tabs will be able to claim it, the others should be added last in the queue again. Since we are now running the checkers without checking the queue, this isn't implemented yet, but this is the plan.

Paying members of Project-GC will steal slots in the queue and a none paying user can actually go from spot 4 to 6 for example. The idea is to put script developers before the paying members as well, to make sure developing is as smooth as possible. It's on the list, but not highest priority. I have a feeling it won't be necessary.

With the new code and current configuration, 8 checkers can be run simultaneous, sharing the 4 cores and 8 GB of RAM (plus some swap).
Re: Updated flow for running checkers
January 11, 2018 03:16AM
Sounds good! Could this maximum execution time value be made available to scripts too, so we don't have to hard-code this value in long running scripts that hunt for better solutions and exits when the running time is almost expired?
Re: Updated flow for running checkers
January 11, 2018 09:28AM
pieterix Wrote:
-------------------------------------------------------
> Sounds good! Could this maximum execution time
> value be made available to scripts too, so we
> don't have to hard-code this value in long running
> scripts that hunt for better solutions and exits
> when the running time is almost expired?

I will add an array named environmentSettings (in addition to today's config). It will include:
  • cli - If true, no need to return any example log, html output or similar. Only the ok-parameter is relevant.
  • maxMemoryUsage - In bytes, I think.
  • maxExecutionTime - In seconds, quite sure. :)

Example: args[1].environmentSettings.maxMemoryUsage
Re: Updated flow for running checkers
January 11, 2018 11:05AM
It is long time to wait 60 seconds for a runaway script :)

Would it easy to add a "manifest" in to the Tag, which limits the time-out and maybe needed memory?

This information could also be used for prioritizing scripts when the running queue if full. It is better to run faster scripts first.
Re: Updated flow for running checkers
January 11, 2018 11:15AM
arisoft Wrote:
-------------------------------------------------------
> It is long time to wait 60 seconds for a runaway
> script :)
>
> Would it easy to add a "manifest" in to the Tag,
> which limits the time-out and maybe needed
> memory?
>
> This information could also be used for
> prioritizing scripts when the running queue if
> full. It is better to run faster scripts first.

Are you mostly thinking from a script developers perspective? When the code doesn't finish due to an endless loop bug (or similar)?

One workaround to that would be to just allow script developers to skip the queue for example.
Re: Updated flow for running checkers
January 11, 2018 11:18AM
ganja1447 Wrote:
-------------------------------------------------------
> arisoft Wrote:
> --------------------------------------------------
> -----
> > It is long time to wait 60 seconds for a
> runaway
> > script :)
> >
> > Would it easy to add a "manifest" in to the
> Tag,
> > which limits the time-out and maybe needed
> > memory?
> >
> > This information could also be used for
> > prioritizing scripts when the running queue if
> > full. It is better to run faster scripts first.
>
> Are you mostly thinking from a script developers
> perspective? When the code doesn't finish due to
> an endless loop bug (or similar)?
>
> One workaround to that would be to just allow
> script developers to skip the queue for example.

By seeing the code live (which it is now), I am also considering to change the behavior so that a user can execute more than one checker at the same time. That would also help your issue. It will require some rewriting of the queue though since the user-id is the key now. I would have to generate a token, which I send back to the client and pass that around. It shouldn't be too hard though, especially not now when I have all the code in my head.
Re: Updated flow for running checkers
January 11, 2018 11:33AM
> Are you mostly thinking from a script developers
> perspective? When the code doesn't finish due to
> an endless loop bug (or similar)?

At the beginning I though endless loops. Sometimes I forget to add x = x + 1 in a while loop etc.

Then I though further, what happens when all cores are running massive 60 second script at the same time.
It could be possible to spare some cores to short jobs only, to prevent long delays if the script it tagged to be fast.
You could use default 30 seconds to all scripts which does not have the manifest.

For example almost any script which scan the data only once, will always run faster than 10 seconds and could be tagged to 10 sec time-out. The default 30 sec would be ok for most scripts and only time consuming goal seeking scripts have to use the maximum 60 sec time-out. And if the queue is empty, why not to allow even more in that case?
Re: Updated flow for running checkers
January 11, 2018 11:38AM
arisoft Wrote:
-------------------------------------------------------
> > Are you mostly thinking from a script
> developers
> > perspective? When the code doesn't finish due
> to
> > an endless loop bug (or similar)?
>
> At the beginning I though endless loops. Sometimes
> I forget to add x = x + 1 in a while loop etc.
>
> Then I though further, what happens when all cores
> are running massive 60 second script at the same
> time.
> It could be possible to spare some cores to short
> jobs only, to prevent long delays if the script it
> tagged to be fast.
> You could use default 30 seconds to all scripts
> which does not have the manifest.
>
> For example almost any script which scan the data
> only once, will always run faster than 10 seconds
> and could be tagged to 10 sec time-out. The
> default 30 sec would be ok for most scripts and
> only time consuming goal seeking scripts have to
> use the maximum 60 sec time-out. And if the queue
> is empty, why not to allow even more in that case?


I see your point, but I don't want to attack a problem that I haven't seen actually exist. Also scripts can be affected by external factors like database replication lag and such. I don't think it's wise to assume a script ends in less than 10 seconds just because it almost always does it.

My personal guess is that the system won't be so busy that users normally will have to queue. If the queue time is less than 5 seconds in 95% of the cases, I don't see a reason to make it more complex than necessary. It all comes down to prioritizing. There are so many things that needs to be done, fixed, implemented and so on, and so few resources to use.
Re: Updated flow for running checkers
January 11, 2018 12:24PM
ganja1447 Wrote:
> I see your point, but I don't want to attack a
> problem that I haven't seen actually exist.

I have the same feeling that there is no need to try prioritizing jobs before there is some problem to solve.

During development it may happen easily, that the endless loops happens many times in a short time span. If the new checker is started before the previous is ended, it means that in the worst case, the devoloper can start new checker even up to 6 times in a 60 second time span. Maybe it is tolerable "abuse" if there is no easy way to kill the previous task forcefully when a new checker is queued from the same user as it does happen very rarely.
Re: Updated flow for running checkers
January 11, 2018 12:42PM
On the other hand, I don't think waiting in queue for 60 seconds is catastrophic for running a checker. As long as it's a rare exception.

One could compare to how it was before checkers. In the best case, it was to load the statistics somewhere, look up the numbers, write them down in a log on so forth. Worst case was to spend hours, or days, trying to figure out if one fulfills it.

From a users perspective that is. For a script developer, I really would like the queue to be 0-3 seconds in > 99% of the cases.
Re: Updated flow for running checkers
January 11, 2018 06:04AM
Good design I think. Will the queuing mechanism allow for multiple VMs to service the request?
Related question, how many PGC points do you get for sponsoring a VM ?
Re: Updated flow for running checkers
January 11, 2018 09:33AM
TravelingGeek Wrote:
-------------------------------------------------------
> Good design I think. Will the queuing mechanism
> allow for multiple VMs to service the request?
> Related question, how many PGC points do you get
> for sponsoring a VM ?

It will not. Though patching that part wouldn't be the hardest thing. As it is now, it would be easier to grow the VM. 4 cores isn't much since we have 8-16 cores on most servers. Running off-site wouldn't be feasible since the machines needs database access.

The fact is that I am not 100% satisfied with how the custom written queue system for this turned out. It may very well be that I will rewrite it at a later point, or at least tweak it. But I am looking forward to testing it live. Hopefully there will be a release today.
Re: Updated flow for running checkers
January 11, 2018 12:24PM
Based on feedback and my own intuition I have now updated the queue system again.

When the client adds itself to a queue, it gets a token. That token makes it unique in the queue instead of the user-id. This means one user can queue several tabs.

I could implement a max number of slots per user if necessary, but I doubt it will be needed. The upside is that script developers won't be locked out from the queue by a job running for max execution time. Just clicking Run checker will create a new token and queue again.

If you as script developers notice that you have wait time in the queue (Current position > 0) a bit too often, please notify me (here, or via pgc support) and I will implement a way for you guys to skip the queue.

The reason I haven't done it yet is because it can (visually) increase the position for other users. I don't really care if a none paying member notices that they are bumped from position 3 to 4, since it says paying members are prioritized. But I don't want the paying members to notice they are falling back in the queue. I would then have to implement a better position reporting (which actually lies). In reality I don't think it will be a noticeable issue.
Re: Updated flow for running checkers
January 11, 2018 07:36PM
What shoud happen if there is syntax error etc. in the script?

I noticed that "Running checker" continues "countdown" forever. It is a bit misleading because the error status is below the visible screen area on my display.
Re: Updated flow for running checkers
January 11, 2018 08:05PM
arisoft Wrote:
-------------------------------------------------------
> What shoud happen if there is syntax error etc. in
> the script?
>
> I noticed that "Running checker" continues
> "countdown" forever. It is a bit misleading
> because the error status is below the visible
> screen area on my display.

I added a JS line in the development environment which will put the mark those progress steps as complete. Ie, freeze the timer and remove the text-muted classes. Those 4 lines should end up the same as if their were an error. Easiest fix.
Re: Updated flow for running checkers
January 11, 2018 09:07PM
Somethin special happened.
After 50 seconds from starting, I got "Error Queue token not available"
After reloading it is working again.
Re: Updated flow for running checkers
January 12, 2018 07:45AM
arisoft Wrote:
-------------------------------------------------------
> Somethin special happened.
> After 50 seconds from starting, I got "Error Queue
> token not available"
> After reloading it is working again.


I have seen similar things today. A few times this morning I noticed the web being insanely slow. Not sure of why, but it's like the httpd isn't responding.

The user has 10 seconds to claim the slot after it has been made available, after that the token is thrown away. If 50 second passes, and you get a token not available message back, it seems to me like it took > 10 seconds from that the slot was available until the web browser managed to connect to the httpd and run the first bit of code. It really should take 0-1 second.

I would of course like to find out why the httpd isn't responding as it should, if that is the case.
Re: Updated flow for running checkers
January 11, 2018 11:15PM
> I added a JS line in the development environment
> which will put the mark those progress steps as
> complete. Ie, freeze the timer and remove the
> text-muted classes. Those 4 lines should end up
> the same as if their were an error. Easiest fix.

There is a side effect. Timer is now running faster and goes to negative numbers. Maybe there is more than one timer counting at the same time after an error message?

Current position in queue: 0
Running checker, this might take up to 60 seconds. Countdown: -444
Retrieving additional information.
Re: Updated flow for running checkers
January 12, 2018 07:41AM
arisoft Wrote:
-------------------------------------------------------
> > I added a JS line in the development
> environment
> > which will put the mark those progress steps as
> > complete. Ie, freeze the timer and remove the
> > text-muted classes. Those 4 lines should end up
> > the same as if their were an error. Easiest
> fix.
>
> There is a side effect. Timer is now running
> faster and goes to negative numbers. Maybe there
> is more than one timer counting at the same time
> after an error message?
>
> Current position in queue: 0
> Running checker, this might take up to 60 seconds.
> Countdown: -444
> Retrieving additional information.

Correct. There will be multiple timers running. If a second timer is started, the id of the first one is lost and can't be stopped either. My fix is still in the development environment, not live. This will be fixed with the next release.
Re: Updated flow for running checkers
January 12, 2018 09:09AM
Got a new error after a long wait.

Unknown error (script error 6, 5a587199be3896.44321106)

Reloading sorted it out.
Re: Updated flow for running checkers
January 12, 2018 09:21AM
arisoft Wrote:
-------------------------------------------------------
> Got a new error after a long wait.
>
> Unknown error (script error 6,
> 5a587199be3896.44321106)
>
> Reloading sorted it out.

That's the old error 5 which got renumbered.

I'll go ahead and make a patch to store additional information. It's a case that should not happen, therefore the cryptic error. It's the last check. It misses some data from the checker.
Sorry, you do not have permission to post/reply in this forum.