PR# 14588 System that runs out of memory goes into infinite loop instead of crashing
Problem Report Summary
Submitter: prestoat2000
Category: Runtime
Priority: Medium
Date: 2008/07/16
Class: Bug
Severity: Serious
Number: 14588
Release: 6.2.75753
Confidential: No
Status: Analyzed
Responsible: ted_eiffel
Environment: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.8.1.13) Gecko/20080328 Firefox/2.0.0.13
Solaris 9 on SPARC
Synopsis: System that runs out of memory goes into infinite loop instead of crashing
Description
A system calls a recursive routine, which creates some STRING objects and then recurses, passing the newly created STRINGs. There is no limit to the depth of recursion, so system execution will run out of memory (this is on 32-bit Solaris SPARC). After system is finalized, system execution should crash with an "out of memory" error, or at least it should crash even if it can't report the error properly due to lack of memory. Instead, it goes into what appears to be an infinite loop, except that it appears to be very slowly making progress, which cannot be correct because no memory is ever freed. In the attached example, I have turned GC off but the same problem will occur if GC is on (taking a *lot* longer). The machine on which I ran this has 96 GB main memory so it is only virtual memory that is being exhausted, hence no thrashing. This problem may be partly or completely due to known issues previously reported, such as bug 13714 (Many calls to `epush' in runtime do not check for failure).
To Reproduce
Problem Report Interactions
We just ran into this problem in production. We were running a 32-bit application which needed more than 2 GB of virtual memory. The application memory use got near 2 GB and then it kept running for more than 60 hours without reporting "out of memory", though we know that it needed more memory. We finally killed it, compiled it as a 64-bit application, and re-ran it. This time the app used about 5 GB memory at its maximum and finished in about 1.5 hours. Please consider trying to improve the "out of memory" behavior for 6.5.
It is slower when the GC is on, because before trying to coalesc memory, we perform a full collection and this operation is technically a no-op in this case. So this makes things much slower. I'll agree that we could change the behavior via MEMORY when it introduces a new behavior. But if we can make it faster without changing the behavior from the end user then we can simply implement it.
Now that we understand the behavior, we should think about potential solutions for awhile before making any changes. Also, the behavior is similar but it gets *much* slower if GC is on. I just turned off GC in my test case to make it easier to investigate the problem because system execution finishes more quickly with GC off. I suggest that any solutions that change the current behavior be enabled via environment variables and/or features in class MEMORY (i.e., leave the default behavior as it is, to avoid causing problems for applications that already work fine).
Ok, I figured out the problem. When there is no more memory available, then we go on anothe path of runtime execution which tries at each failed allocation to coalesce the memory and of course each time it fails to find any space. When this happens, it can still allocate some memory in the C memory area, but for each single allocation, it goes through the all Eiffel memory to see if there is an available piece of memory that we could get. This is why it then get very slow. I would say we could improve the behavior by either preventing search in the Eiffel list, if GC is disabled and the last search was unsuccessful. Or we could say, don't look in the C memory at all in this situation but that might be bad (since we have quite a few available bytes in there).
Yes, I believe we are seeing the same behavior. I'm not sure this is a bug in one sense, since the program ultimately produces the correct output (an "out of memory" exception). It seems that the runtime tries really hard to keep going when there isn't much memory, which may be the right thing to do. But from a user point of view, it is rather disconcerting that program execution slows down so much when almost out of available memory. Maybe there is no better solution than the current implementation. Or perhaps there could be some way to tune how hard the memory allocation routines try to coalesce chunks when low on memory.
While testing the updated test, I find out that it spent most of its time in `chunk_coalesc' before actually reaching the out of memory problem. I observed this because after the 7941th iteration it ran much slower (about 1s per iteration whereas it did the first 7941 in less than 6s) until reaching iteration 8291 where it got the exception out of memory and printed the stack trace immediately and exited immdiately. After I removed the call to `chunk_coalesc' done via `full_coalesc_unsafe' in `malloc_free_list' then the program executes quickly and fails quickly. So I'm not sure if we are observing the same behavior here.
If you use the revised version (test2_rev1.e), which is nearly identical to the original except that is sets a maximum amount of memory and turns GC off, you should still see that it takes a long time to report "out of memory". I recommend the revised version since it won't cause any thrashing but still demonstrates the problem.
I've tried your original program and it crashed very quickly on my machine because stack size is set to 8MB and thus got a stack overflow. If I remove the stacksize limit after allocating all the possible memory I could on my sparc machine (less than 3GB) I got this truss output: brk(0x6EB39360) = 0 brk(0x6EF39360) = 0 brk(0x6EF39360) = 0 brk(0x6F339360) Err#11 EAGAIN brk(0x6EF39360) = 0 brk(0x6F339360) Err#11 EAGAIN but then it got stuck for a very long time with low CPU activity for me (around 0.77%). I'm now trying again with attaching a C debugger to see where it is, but I think that it is most likely traversing the whole memory during a GC cycle.
Sorry for the confusion. I mean that after adding the `set_max_mem' call the test does not take a long time to complete. I don't think the problem shows up with a revised version of eweasel test exec107. The revised version of test2.e (attached as test2_rev1.e) reproduces the problem for me on Solaris 10 on SPARC. Finalize with it and test2.ecf file from the original report, discarding assertions. Run it with arguments "50000000 100000". It runs for about 8 minutes (from memory) before reporting an "out of memory" exception. Removing the "collection_off" line causes it to take several hours.
What do you mean by "So adding `set_max_mem' causes test to behaveidentically for me."? Does it mean that it takes a long time before reporting out of memory, or just that it works as well as if `set_max_mem' was not present? I'm trying to build a test that reproduces the long delay but so far I cannot reproduce it.
I'm not sure what you mean by "same behavior". If I modify TEST in exec107 to do set_max_mem (50_000_000) it still passes - actually, it fails but only because of the messages Cannot allocate memory: too much in comparison with maximum allowed! which have now been removed. So adding `set_max_mem' causes test to behave identically for me. I should also emphasize that execution of the test case in the original report does *not* (apparently) cause an infinite loop. But it takes *much* longer than expected before an "out of memory" exception is raised.
If you apply the `set_max_mem' to eweasel test#exec107 do you observer the same behavior? I'm thinking adding this to `exec107' as well as otherwise the test has to be made manual or we need to change it to limit the amount of memory the process can ask from the OS.
I just realized that there is an easy way to investigate this strange behavior that does not require a system with a large amount of memory. Just add the following as the first line of the root creation procedure: set_max_mem (50_000_000) Then it will never allocate more than about 50 MB but the same unfriendly behavior (continuing to run but *very* slowly and eventually raising an out of memory exception) will occur. I made this change and confirmed that the system will run for a long time before it finally reports "out of memory".
The machine where I ran the original test has 96 GB main memory. A 32-bit program runs out of memory when it hits 2 GB so I don't think any swapping was involved. I also saw that when memory use got up to about 2 GB, the system was still using 100% of one CPU. So it is actively trying to do something but is not thrashing. I compiled the latest run-time via ./quick_configure and then linked the system (compiled with 6.3.74123) with the newest run-time. It did not improve things. If I get dtrace permission on an appropriate machine, I can get the stack trace for every malloc failure, which might be useful information.
Now that I reread your post, I realize that you may have hit the bug I just fixed on the buffer overflow when building the exception object. It is possible that after that the runtime code went into an infinite loop because memory was corrupted. We should try wih the very latest runtime to see if this is still happening.
It is not because malloc is failing that we will fail. When this happens, we run a GC collection which traverses all the available memory to see if we cannot reclaim some Eiffel objects to make some room. If it fails, then we explode the scavenge zone. When doing this, we can accomodate some more allocations untill we need a new block of memory which is most likely to fail and then we might throw the out of memory exception. But because part of the memory has been swapped to disk, the performance are usually very bad and it takes a very very long time in my experience to get the exception trace. This is why Ted reported that he had to wait a long time. On this issue, I was thinking that we could have a mode in the runtime to limit the allocated memory to what the OS can actually handle physically. That way you will never trash your system.
I ran the system in the original report under truss and confirmed that malloc is indeed failing: /1@1: -> libc:malloc(0x400028, 0x1, 0x13af4, 0x13800) /1: brk(0x7F50EFE0) = 0 /1: brk(0x7F90EFE0) Err#12 ENOMEM /1@1: <- libc:malloc() = 0 In the output above, we see a call to malloc requesting 0x400028 bytes. Malloc calls `brk', which fails with an ENOMEM error. Malloc then returns 0 (null). So malloc is failing. Yet the program continues to run long after this failure before it finally reports an "out of memory" error.
I agree - if malloc does not fail there is not much you can do. However, I believe that malloc may be failing in this case. I think you should fix all of the following bugs and then re-check the behavior of this test case (especially 13714): 13714 Many calls to `epush' in runtime do not check for failure 13690 Multiple unchecked memory allocations in run-time 13752 Compute_eif_type_id returns 0 if memory allocation failure 14586 Eget may return too small area if `extend' has to get urgent chunk 14585 Exget can return address past end of chunk if stack_extend has to get urgent chunk 14581 Last urgent chunk cannot be used
Dear David, This seems to me an OS issue. If `malloc' does not fail we cannot do much. On Windows, I got stack overflow before No Memory exception. Only when I extend the stack size to very large, the program apparently runs into infinite loop. The OS finally becomes very slow, but no crash. I think in that situation the OS may think it still possible to allocate memory, but simply very slow.
Actually, the program finally crashed with "No more memory", but it took 2.5 hours to do it. Program used essentially all the memory in less than 2 minutes, then took another 2.5 hours until it reported "Out of memory". Not a very friendly behavior.