PR# 16593 Re-finalizing after VSRT(1) error causes panic or seg fault
Problem Report Summary
Submitter: prestoat2000
Category: Compiler
Priority: Medium
Date: 2010/01/27
Class: Bug
Severity: Serious
Number: 16593
Release: 6.5.81777
Confidential: No
Status: Closed
Responsible:
Environment: Mozilla/5.0 (X11; U; SunOS i86pc; en-US; rv:1.9.1.6) Gecko/20100104 Firefox/3.5.6
OpenSolaris snv_131 on x86
Synopsis: Re-finalizing after VSRT(1) error causes panic or seg fault
Description
I have a reproducible test case that crashes the compiler (either estudio or using "ec -loop") with either a seg fault (*not* a call on Void target) and exception trace or with a run-time panic and no exception trace. This is with 6.5 on OpenSolaris x86. The test case is a fairly large system using precompiled Vision2. The bug only seems to be reproducible in one estudio or "ec -loop" session. I finalize the system, discarding assertions. Then I change the root class and creation procedure from {DELTA}.make to {ARRAY}.make and re-finalize. When the compiler reports a VSRT(1) error, I change the root creation procedure back to {DELTA}.make and resume compilation. The compiler always crashes in degree -3. When I have inlining enabled in the config file, the compiler panics and there is no exception trace. However, a core file is created. The stack in that core file starts with: t@1 (l@1) signal SEGV (no mapping at the fault address) in epop at 0xaaf4cf0 0x0aaf4cf0: epop+0x0098: movl 0x0000000c(%edi),%eax (dbx) bt bt: not found (dbx) where current thread: t@1 =>[1] epop(0xd9bc0bc, 0x46479, 0x18, 0xab026c6), at 0xaaf4cf0 [2] map_reset(0x1, 0x0, 0x2, 0xab04b15), at 0xab0277b [3] edclone(0xddaf330, 0xddaf330, 0x8043608, 0x99317be), at 0xab04d38 The second argument to `epop' is the number of items to be popped off the stack and it is ridiculously large (0x46479). It looks like something is getting corrupted. If I finalize without inlining (and without the -verbose flag to "ec -loop"), I get a segmentation fault in degree 3. I'm not aware of any bugs that crash the compiler in such an extreme manner. Before I try to reproduce this on a smaller example, could you tell me whether this sounds like any known bug?
To Reproduce
Problem Report Interactions
Thanks for fixing this so quickly. I believe this is a long-standing bug. I confirmed that the crash occurs in both 6.4 and 6.3 on Solaris x86. It may have been there forever. I also believe that the bug only occurs when re-finalizing within the same session. We run a new ec session each time we re-finalize before releasing to production, which is probably why we never hit this bug. Since we used 6.3 for a year without problems, I think we can safely wait for a fix in 6.6.
It is now fixed at rev#82211. As far as I can tell this crash will not occur when quitting between two finalizations and thus I'm not too worried of the impact it might have one old releases of EiffelStudio. I will nonetheless patch 6.5 in case we need to redo a 6.5 release for other reasons.
I found the corruption location. Here is the actual stack trace: Degree -3: Generating Optimized Code ISE Eiffel: Session aborted Exception tag: valid_index ecb: system execution failed. Following is the set of recorded exceptions: ------------------------------------------------------------------------------- Class / Object Routine Nature of exception Effect ------------------------------------------------------------------------------- PACKED_BOOLEANS put @1 valid_index: <0000000008FB5188> Precondition violated. Fail ------------------------------------------------------------------------------- EIFFEL_HISTORY mark_used @2 <0000000008FB5148> Routine failure. Fail ------------------------------------------------------------------------------- ATTRIBUTE_BL generate_access_on_type @29 <0000000002D9BC18> Routine failure. Fail .... Output truncated, Click download to get the full message
It was indeed not the cause of the crash. Still investigating.
I'm certainly concerned by that but as far as I can tell now, it might not be the cause of the bug as although we do an out of bounds SPECIAL area, we are only doing it in reading and for reading some integer value. The wrong reading is only causing a computation to return an integer value of one more than required, meaning they are holes in our register allocation but really I doubt it causes the kind of crash. I would know more in 15 minutes as I'm retesting with the fix included. We do run eweasel once in a while with assertions enabled but not always.
Thanks for looking into this so quickly. After you have confirmed that the fix works, could you please tell me whether I have to worry about this bug for normal day-to-day work with 6.5? The compiler appears to work just fine in most cases but I'm conerned that it could produce incorrect output so that finalized programs don't work right. Is this possible or likely? I don't know whether this would have caught the problem, but you might consider running the eweasel tests before a final release with a version of the compiler that has preconditions enabled on all classes. Or maybe you already do this.
I think I found the problem. It was introduced in rev#81484 of the compiler. I'm testing with the fix included before closing.
Just to say that I was able to reproduce the crash on Windows 64-bit too. I'll debug it further today.
After further experimentation, I have confirmed that none of the following seem to affect whether a crash occurs (I tried with each option enabled or disabled): Inlining Exception trace Check for Void target Dead code removal No precompilation used Since the crash occurs when using "ec -loop", I assume this is not a multi-threading bug, since I don't think multithreading is used in this case. I also found that the crash occurs if the first and third compiles are finalizes and the middle one is a melt. And my instructions for reproducing the crash show that it occurs when there are no compilation errors in any of the 3 compiles. Here is another crash I got while running under dbx. t@1 (l@1) signal SEGV (no mapping at the fault address) in (unknown) at 0xaaef1cb 0x0aaef1cb: _get_exit_frame_monitor+0x2663da7: movl %edx,(%esi) (dbx) where current thread: t@1 [1] 0xaaef1cb(0xacea7e0, 0x80466f8, 0xaaef0e4, 0x11e6e2b1), at 0xaaef1cb [2] 0xaaef0e4(0x11e6e2b1, 0x52, .... Output truncated, Click download to get the full message
Here is one more exception trace I got. It seems that by varying things a little, I can get all kinds of different exception traces. It definitely looks like something is getting corrupted. ------------------------------------------------------------------------------- ARRAY item @1 Segmentation violation: <000000000E87212C> Operating system signal. Fail ------------------------------------------------------------------------------- In this case, I didn't change the .ecf file. I simply finalized the system (using ec -loop and without assertions or inlining), then changed the root class so that it didn't reference any other classes except ARRAY and STRING. I re-finalized in same ec session. Then I changed the root class back to its original contents and re-finalized again. This resulted in the attached trace (stack17.txt). To reproduce the crash: Unpack attached tar file, creating a directory "bug". Change bug/bug.ecf l .... Output truncated, Click download to get the full message
The crash is also reproducible on Solaris SPARC 32-bit.
Although it is probably not useful, here is one the exception traces I got (when compiling with "ec -loop" without -verbose and without inlining) after the re-finalize. Top of the stack looks like: ------------------------------------------------------------------------------- FEATURE_SERVER item @6 <000000000E873404> (From COMPILER_SERVER) Feature call on void target. Fail ------------------------------------------------------------------------------- In this case, there is a call on Void target. Another time I got a seg fault with a completely different trace. Full trace attached.