PR# 14588 System that runs out of memory goes into infinite loop instead of crashing

Problem Report Summary

Submitter: prestoat2000

Category: Runtime

Priority: Medium

Date: 2008/07/16

Class: Bug

Severity: Serious

Number: 14588

Release: 6.2.75753

Confidential: No

Status: Analyzed

Responsible: ted_eiffel

Environment: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.8.1.13) Gecko/20080328 Firefox/2.0.0.13 Solaris 9 on SPARC

Synopsis: System that runs out of memory goes into infinite loop instead of crashing

Description

A system calls a recursive routine, which creates some STRING objects and then
recurses, passing the newly created STRINGs.  There is no limit to the depth
of recursion, so system execution will run out of memory (this is on 32-bit
Solaris SPARC).  After system is finalized, system execution should crash with
an "out of memory" error, or at least it should crash even if it can't report
the error properly due to lack of memory.  Instead, it goes into what appears
to be an infinite loop, except that it appears to be very slowly making
progress, which cannot be correct because no memory is ever freed.

In the attached example, I have turned GC off but the same problem will
occur if GC is on (taking a *lot* longer).  The machine on which I ran this
has 96 GB main memory so it is only virtual memory that is being exhausted,
hence no thrashing.

This problem may be partly or completely due to known issues previously
reported, such as bug 13714 (Many calls to `epush' in runtime do not check for
failure).

To Reproduce

Problem Report Interactions

From:prestoat2000 Date:2009/04/06 Download

We just ran into this problem in production.  We were running a 32-bit
application which needed more than 2 GB of virtual memory.  The application
memory use got near 2 GB and then it kept running for more than 60 hours
without reporting "out of memory", though we know that it needed more memory.
We finally killed it, compiled it as a 64-bit application, and re-ran it.
This time the app used about 5 GB memory at its maximum and finished in about
1.5 hours.

Please consider trying to improve the "out of memory" behavior for 6.5.

From:manus_eiffel Date:2008/07/31 Download

It is slower when the GC is on, because before trying to coalesc memory, we perform a full collection and this operation is technically a no-op in this case. So this makes things much slower.

I'll agree that we could change the behavior via MEMORY when it introduces a new behavior. But if we can make it faster without changing the behavior from the end user then we can simply implement it.

From:prestoat2000 Date:2008/07/31 Download

Now that we understand the behavior, we should think about potential solutions
for awhile before making any changes. Also, the behavior is similar but it gets
*much* slower if GC is on.  I just turned off GC in my test case to make it
easier to investigate the problem because system execution finishes more 
quickly with GC off.

I suggest that any solutions that  change the current behavior be enabled
via environment variables and/or features in class MEMORY (i.e., leave the
default behavior as it is, to avoid causing problems for applications that
already work fine).

From:manus_eiffel Date:2008/07/31 Download

Ok, I figured out the problem. When there is no more memory available, then we go on anothe path of runtime execution which tries at each failed allocation to coalesce the memory and of course each time it fails to find any space. When this happens, it can still allocate some memory in the C memory area, but for each single allocation, it goes through the all Eiffel memory to see if there is an available piece of memory that we could get.

This is why it then get very slow.

I would say we could improve the behavior by either preventing search in the Eiffel list, if GC is disabled and the last search was unsuccessful. Or we could say, don't look in the C memory at all in this situation but that might be bad (since we have quite a few available bytes in there).

From:prestoat2000 Date:2008/07/31 Download

Yes, I believe we are seeing the same behavior.  I'm not sure this is a bug
in one sense, since the program ultimately produces the correct output
(an "out of memory" exception).  It seems that the runtime tries really hard
to keep going when there isn't much memory, which may be the right thing to do.
But from a user point of view, it is rather disconcerting that program
execution slows down so much when almost out of available memory.  Maybe there
is no better solution than the current implementation.  Or perhaps there could
be some way to tune how hard the memory allocation routines try to coalesce
chunks when low on memory.

From:manus_eiffel Date:2008/07/31 Download

While testing the updated test, I find out that it spent most of its time in `chunk_coalesc' before actually reaching the out of memory problem. I observed this because after the 7941th iteration it ran much slower (about 1s per iteration whereas it did the first 7941 in less than 6s) until reaching iteration 8291 where it got the exception out of memory and printed the stack trace immediately and exited immdiately.

After I removed the call to `chunk_coalesc' done via `full_coalesc_unsafe' in `malloc_free_list' then the program executes quickly and fails quickly.

So I'm not sure if we are observing the same behavior here.

From:prestoat2000 Date:2008/07/31 Download

If you use the revised version (test2_rev1.e), which is nearly identical to
the original except that is sets a maximum amount of memory and turns GC off,
you should still see that it takes a long time to report "out of memory".
I recommend the revised version since it won't cause any thrashing but still
demonstrates the problem.

From:manus_eiffel Date:2008/07/31 Download

I've tried your original program and it crashed very quickly on my machine because stack size is set to 8MB and thus got a stack overflow. If I remove the stacksize limit after allocating all the possible memory I could on my sparc machine (less than 3GB) I got this truss output:

brk(0x6EB39360)                                 = 0
brk(0x6EF39360)                                 = 0
brk(0x6EF39360)                                 = 0
brk(0x6F339360)                                 Err#11 EAGAIN
brk(0x6EF39360)                                 = 0
brk(0x6F339360)                                 Err#11 EAGAIN

but then it got stuck for a very long time with low CPU activity for me (around 0.77%). I'm now trying again with attaching a C debugger to see where it is, but I think that it is most likely traversing the whole memory during a GC cycle.

From:prestoat2000 Date:2008/07/31 Download

Sorry for the confusion.  I mean that after adding the `set_max_mem' call the
test does not take a long time to complete.  I don't think the problem shows
up with a revised version of eweasel test exec107.

The revised version of test2.e (attached as test2_rev1.e) reproduces the
problem for me on Solaris 10 on SPARC.  Finalize with it and test2.ecf file
from the original report, discarding assertions.  Run it with arguments
"50000000 100000".  It runs for about 8 minutes (from memory) before 
reporting an "out of memory" exception.  Removing the "collection_off" line
causes it to take several hours.

Attachment: test2_rev1.e Size:971

From:manus_eiffel Date:2008/07/31 Download

What do you mean by "So adding `set_max_mem' causes test to behaveidentically for me."? Does it mean that it takes a long time before reporting out of memory, or just that it works as well as if `set_max_mem' was not present?

I'm trying to build a test that reproduces the long delay but so far I cannot reproduce it.

From:prestoat2000 Date:2008/07/31 Download

I'm not sure what you mean by "same behavior".  If I modify TEST in exec107
to do 

   set_max_mem (50_000_000)

it still passes - actually, it fails but only because of the messages

   Cannot allocate memory: too much in comparison with maximum allowed!

which have now been removed.  So adding `set_max_mem' causes test to behave
identically for me.

I should also emphasize that execution of the test case in the original
report does *not* (apparently) cause an infinite loop.  But it takes *much*
longer than expected before an "out of memory" exception is raised.

From:manus_eiffel Date:2008/07/31 Download

If you apply the `set_max_mem' to eweasel test#exec107 do you observer the same behavior? I'm thinking adding this to `exec107' as well as otherwise the test has to be made manual or we need to change it to limit the amount of memory the process can ask from the OS.

From:prestoat2000 Date:2008/07/28 Download

I just realized that there is an easy way to investigate this strange
behavior that does not require a system with a large amount of memory.
Just add the following as the first line of the root creation procedure:

   set_max_mem (50_000_000)

Then it will never allocate more than about 50 MB but the same unfriendly
behavior (continuing to run but *very* slowly and eventually raising an out of
memory exception) will occur.  I made this change and confirmed that the
system will run for a long time before it finally reports "out of memory".

From:prestoat2000 Date:2008/07/28 Download

The machine where I ran the original test has 96 GB main memory.  A 32-bit
program runs out of memory when it hits 2 GB so I don't think any swapping
was involved.  I also saw that when memory use got up to about 2 GB, the
system was still using 100% of one CPU.  So it is actively trying to do
something but is not thrashing.

I compiled the latest run-time via ./quick_configure and then linked the
system (compiled with 6.3.74123) with the newest run-time.  It did not improve
things.

If I get dtrace permission on an appropriate machine, I can get the stack
trace for every malloc failure, which might be useful information.

From:manus_eiffel Date:2008/07/28 Download

Now that I reread your post, I realize that you may have hit the bug I just fixed on the buffer overflow when building the exception object. It is possible that after that the runtime code went into an infinite loop because memory was corrupted.

We should try wih the very latest runtime to see if this is still happening.

From:manus_eiffel Date:2008/07/28 Download

It is not because malloc is failing that we will fail. When this happens, we run a GC collection which traverses all the available memory to see if we cannot reclaim some Eiffel objects to make some room. If it fails, then we explode the scavenge zone. When doing this, we can accomodate some more allocations untill we need a new block of memory which is most likely to fail and then we might throw the out of memory exception.

But because part of the memory has been swapped to disk, the performance are usually very bad and it takes a very very long time in my experience to get the exception trace. This is why Ted reported that he had to wait a long time.

On this issue, I was thinking that we could have a mode in the runtime to limit the allocated memory to what the OS can actually handle physically. That way you will never trash your system.

From:prestoat2000 Date:2008/07/28 Download

I ran the system in the original report under truss and confirmed that malloc
is indeed failing:

/1@1:   -> libc:malloc(0x400028, 0x1, 0x13af4, 0x13800)
/1:     brk(0x7F50EFE0)                                 = 0
/1:     brk(0x7F90EFE0)                                 Err#12 ENOMEM
/1@1:   <- libc:malloc() = 0

In the output above, we see a call to malloc requesting 0x400028 bytes.
Malloc calls `brk', which fails with an ENOMEM error.  Malloc then returns 0
(null).  So malloc is failing.  Yet the program continues to run long after
this failure before it finally reports an "out of memory" error.

From:prestoat2000 Date:2008/07/23 Download

I agree - if malloc does not fail there is not much you can do.
However, I believe that malloc may be failing in this case.

I think you should fix all of the following bugs and then
re-check the behavior of this test case (especially 13714):

13714    Many calls to `epush' in runtime do not check for failure
13690    Multiple unchecked memory allocations in run-time
13752    Compute_eif_type_id returns 0 if memory allocation failure
14586    Eget may return too small area if `extend' has to get urgent chunk
14585    Exget can return address past end of chunk if stack_extend has to get urgent chunk
14581    Last urgent chunk cannot be used

From:ted_eiffel Date:2008/07/22 Status: Analyzed Download

Dear David,

This seems to me an OS issue. If `malloc' does not fail we cannot do much. On Windows, I got stack overflow before No Memory exception. Only when I extend the stack size to very large, the program apparently runs into infinite loop. The OS finally becomes very slow, but no crash. I think in that situation the OS may think it still possible to allocate memory, but simply very slow.

From:prestoat2000 Date:2008/07/16 Download

Actually, the program finally crashed with "No more memory", but it took
2.5 hours to do it.  Program used essentially all the memory in less than 2
minutes, then took another 2.5 hours until it reported "Out of memory".
Not a very friendly behavior.

From:prestoat2000 Date:2008/07/16 Download

Attachments for problem report #14588

Attachment: test2.e Size:803

Attachment: test2.ecf Size:1262