PR# 19533 100% CPU usage and not responding webserver

Problem Report Summary
Submitter: andersoxie
Category: Other
Priority: High
Date: 2019/03/13
Class: Bug
Severity: Critical
Number: 19533
Release: 18.11
Confidential: No
Status: Analyzed
Responsible:
Environment: Ubuntu
Synopsis: 100% CPU usage and not responding webserver

Description
After some days execution the server stops responding. When looking at the server I see that the process is using 100% of one of the servers CPUs.

I have attached a logfile that might give some indication of the problem, I see some rows with error 

Internal error in WEB_SOCKET.do_send (conn, a_opcode=8, a_message) 

but of course that might not be related to the problem.

Att the moment I have no more information the could help in the investigation. Is it something I can add to the server code to get more information?

It has happened about 10 times now.


To Reproduce

										
Problem Report Interactions
From:andersoxie    Date:2019/05/16    Status: Analyzed    Download   
FYI: I made some minor additions to my implementation, at the same time as when I sent the code to you in this case, and after that change I have not had the 100% CPU problem (yet). 

From:andersoxie    Date:2019/04/29    Status: Analyzed    Download   
Added libraries that I might have updated 

Attachment: github_eiffel_liraries.7z     Size:161660
From:jfiat_es    Date:2019/04/29    Status: Analyzed    Download   
Hi Anders,

I tried to compile using 18.11 getting external code from iron or github (mostly from Larry Rix repositories).
But the compiler complains about VD01 ... i.e issue with concurrency capabilities.
Do you have local changes for those libraries? or else?
I think I could fix the capabilities, and get it compilable, but before doing that, I just wanted to be sure I haven't missed any step to build this project.

From:andersoxie    Date:2019/04/21    Status: Analyzed    Download   
Even if I do not know how to let you execute the server I attach the source code and some other files. 

I have removed alot of files that either I don not think you need or that are to big to attach.

It is in the middle of my development so ignore some things that looks strange, e.g. I try to learn SVG. I have rmeoved some js files, for example svg.js, svg.min.js svg.connectable.js

It might be that you can build it



Attachment: Wunderlistreplacement.7z     Size:36052
From:andersoxie    Date:2019/04/20    Status: Analyzed    Download   
As extra information I have noticed that if I just have the server running without contacting it the server continuous to perform well.  On instance of the server has been running for several weeks without any problems. 

From:andersoxie    Date:2019/04/16    Status: Analyzed    Download   
I will look into it in the weekend. One problem is that I am using neo4j.com graph database as part of my service which makes it difficult to send you something working. 

From:jfiat_es    Date:2019/04/15    Status: Analyzed    Download   
Hi Anders,

We found out an potential issue with SCOOP and evaluation of once features combined with process/region impersonation.
We'll be working to fix this SCOOP issue, this may be the cause of your trouble, or not. I haven't been able to reproduce yet your issue, so I can not ensure this would fix your issue.
For this SCOOP issue, turning off the GC would be a workaround, however if your application is using a lot of object, this will not be a good solution in the long term.

What you could also try, if to use "thread" concurrency. If you don't use yourself SCOOP directly, the EiffelWeb standalone can be compiled and executed with any concurrency mode.

What would help us to reproduce your issue, is to have a reproducible example, or even a virtual machine image to reproduce it.
If possible, providing your source code would help, so we could try to reproduce the issue locally.
So any solution that would allow us to reproduce the issue would really help, especially if it can takes a few days of 
....
Output truncated, Click download to get the full message

From:andersoxie    Date:2019/03/19    Status: Analyzed    Download   
I am aware that this is difficult to pinpoint from your side. I will try to investigate more and especially I think it would be good if you could help me to add some tracing (debug prints) in the code so that we could find the problem. I will try to answer yor questions below.

A few questions:
1) Do you have similar issue without ssl support?

I have not tried that. It is difficult to test since I am not aware how it is triggered and I do not want to expose my site without ssl.

2) Do you use SCOOP or Thread concurrency mode? Do you have same issue with any mode?

I am using SCOOP

3) You said, it is about 10 times. Can you tell us the period? 10 times a day, a week, .. since your server is deployed?

I takes several days at least until it fails after deploy. I am not sure what triggers it. If it is the execution time or the number of request or if it is some special combination of requests.

4) When this occurs, do you know if the server was running for a long time? or this does not matter?

....
Output truncated, Click download to get the full message

From:jfiat_es    Date:2019/03/18    Status: Analyzed    Download   
Hi Anders,

A few questions:
1) Do you have similar issue without ssl support?
2) Do you use SCOOP or Thread concurrency mode? Do you have same issue with any mode?
3) You said, it is about 10 times. Can you tell us the period? 10 times a day, a week, .. since your server is deployed?
4) When this occurs, do you know if the server was running for a long time? or this does not matter?
    From the log, we have the pseudo id of the incoming request #.# , and it seems the first error appears quickly.
5) Do you know when the 100% CPU issue happened? To solve it, did you had to restart the server?
6) Do you have by any luck, a system that I can run to reproduce the issue and debug it ?

From:andersoxie    Date:2019/03/13    Download   
Attachments for problem report #19533

Attachment: bsharptodo.log     Size:122131