PR# 16593 Re-finalizing after VSRT(1) error causes panic or seg fault

Problem Report Summary

Submitter: prestoat2000

Category: Compiler

Priority: Medium

Date: 2010/01/27

Class: Bug

Severity: Serious

Number: 16593

Release: 6.5.81777

Confidential: No

Status: Closed

Responsible:

Environment: Mozilla/5.0 (X11; U; SunOS i86pc; en-US; rv:1.9.1.6) Gecko/20100104 Firefox/3.5.6 OpenSolaris snv_131 on x86

Synopsis: Re-finalizing after VSRT(1) error causes panic or seg fault

Description

I have a reproducible test case that crashes the compiler
(either estudio or using "ec -loop") with either a seg fault
(*not* a call on Void target) and exception trace or with a
run-time panic and no exception trace.  This is with 6.5 on
OpenSolaris x86.  The test case is a fairly
large system using precompiled Vision2.  The bug only seems to
be reproducible in one estudio or "ec -loop" session.

I finalize the system, discarding assertions.  Then I change the
root class and creation procedure from {DELTA}.make to {ARRAY}.make
and re-finalize.  When the compiler reports a VSRT(1) error,
I change the root creation procedure back to {DELTA}.make and resume
compilation.  The compiler always crashes in degree -3.

When I have inlining enabled in the config file, the compiler panics
and there is no exception trace.  However, a core file is created.
The stack in that core file starts with:

t@1 (l@1) signal SEGV (no mapping at the fault address) in epop at 0xaaf4cf0
0x0aaf4cf0: epop+0x0098:	movl     0x0000000c(%edi),%eax
(dbx) bt
bt: not found
(dbx) where
current thread: t@1
=>[1] epop(0xd9bc0bc, 0x46479, 0x18, 0xab026c6), at 0xaaf4cf0 
  [2] map_reset(0x1, 0x0, 0x2, 0xab04b15), at 0xab0277b 
  [3] edclone(0xddaf330, 0xddaf330, 0x8043608, 0x99317be), at 0xab04d38 

The second argument to `epop' is the number of items to be popped
off the stack and it is ridiculously large (0x46479).  It looks like
something is getting corrupted.

If I finalize without inlining (and without the -verbose flag to 
"ec -loop"), I get a segmentation fault in degree 3. 

I'm not aware of any bugs that crash the compiler in such an extreme manner.
Before I try to reproduce this on a smaller example, could you tell me
whether this sounds like any known bug?

To Reproduce

Problem Report Interactions

From:prestoat2000 Date:2010/01/30 Download

Thanks for fixing this so quickly.  I believe this is a long-standing
bug.  I confirmed that the crash occurs in both 6.4 and 6.3 on
Solaris x86.  It may have been there forever.  I also believe that
the bug only occurs when re-finalizing within the same session.
We run a new ec session each time we re-finalize before releasing
to production, which is probably why we never hit this bug.

Since we used 6.3 for a year without problems, I think we can safely wait 
for a fix in 6.6.

From:manus_eiffel Date:2010/01/30 Status: Closed Download

It is now fixed at rev#82211. As far as I can tell this crash will not occur when quitting between two finalizations and thus I'm not too worried of the impact it might have one old releases of EiffelStudio. I will nonetheless patch 6.5 in case we need to redo a 6.5 release for other reasons.

From:manus_eiffel Date:2010/01/29 Download

I found the corruption location. Here is the actual stack trace:

Degree -3: Generating Optimized Code
ISE Eiffel: Session aborted
Exception tag: valid_index

ecb: system execution failed.
Following is the set of recorded exceptions:

-------------------------------------------------------------------------------
Class / Object      Routine                Nature of exception           Effect
-------------------------------------------------------------------------------
PACKED_BOOLEANS     put @1                 valid_index:
<0000000008FB5188>                         Precondition violated.        Fail
-------------------------------------------------------------------------------
EIFFEL_HISTORY      mark_used @2
<0000000008FB5148>                         Routine failure.              Fail
-------------------------------------------------------------------------------
ATTRIBUTE_BL        generate_access_on_type @29
<0000000002D9BC18>                         Routine failure.              Fail
....
Output truncated, Click download to get the full message

From:manus_eiffel Date:2010/01/29 Download

It was indeed not the cause of the crash. Still investigating.

From:manus_eiffel Date:2010/01/29 Download

I'm certainly concerned by that but as far as I can tell now, it might not be the cause of the bug as although we do an out of bounds SPECIAL area, we are only doing it in reading and for reading some integer value. The wrong reading is only causing a computation to return an integer value of one more than required, meaning they are holes in our register allocation but really I doubt it causes the kind of crash.

I would know more in 15 minutes as I'm retesting with the fix included.

We do run eweasel once in a while with assertions enabled but not always.

From:prestoat2000 Date:2010/01/29 Download

Thanks for looking into this so quickly.  After you have confirmed that the fix
works, could you please tell me whether I have to worry about this bug for
normal day-to-day work with 6.5?  The compiler appears to work just fine in most
cases but I'm conerned that it could produce incorrect output so that finalized
programs don't work right.  Is this possible or likely?

I don't know whether this would have caught the problem, but you might
consider running the eweasel tests before a final release with a version of the compiler
that has preconditions enabled on all classes.  Or maybe you already do this.

From:manus_eiffel Date:2010/01/29 Download

I think I found the problem. It was introduced in rev#81484 of the compiler. I'm testing with the fix included before closing.

From:manus_eiffel Date:2010/01/29 Status: Analyzed Download

Just to say that I was able to reproduce the crash on Windows 64-bit too. I'll debug it further today.

From:prestoat2000 Date:2010/01/28 Download

After further experimentation, I have confirmed that none of the following
seem to affect whether a crash occurs (I tried with each option enabled
or disabled):

   Inlining
   Exception trace
   Check for Void target
   Dead code removal
   No precompilation used

Since the crash occurs when using "ec -loop", I assume this is not a
multi-threading bug, since I don't think multithreading is used in this case.
I also found that the crash occurs if the first and third compiles are
finalizes and the middle one is a melt.  And my instructions for reproducing
the crash show that it occurs when there are no compilation errors in
any of the 3 compiles.

Here is another crash I got while running under dbx.

t@1 (l@1) signal SEGV (no mapping at the fault address) in (unknown) at 0xaaef1cb
0x0aaef1cb: _get_exit_frame_monitor+0x2663da7:	movl     %edx,(%esi)
(dbx) where
current thread: t@1
  [1] 0xaaef1cb(0xacea7e0, 0x80466f8, 0xaaef0e4, 0x11e6e2b1), at 0xaaef1cb 
  [2] 0xaaef0e4(0x11e6e2b1, 0x52,
....
Output truncated, Click download to get the full message

Attachment: stack18.txt Size:3838

From:prestoat2000 Date:2010/01/27 Download

Here is one more exception trace I got.  It seems that by varying things
a little, I can get all kinds of different exception traces.  It
definitely looks like something is getting corrupted.

-------------------------------------------------------------------------------
ARRAY               item @1                Segmentation violation:      
<000000000E87212C>                         Operating system signal.      Fail
-------------------------------------------------------------------------------

In this case, I didn't change the .ecf file.  I simply finalized the
system (using ec -loop and without assertions or inlining), then changed
the root class so that it didn't reference any other classes except
ARRAY and STRING.  I re-finalized in same ec session.  Then I changed
the root class back to its original contents and re-finalized again.
This resulted in the attached trace (stack17.txt).

To reproduce the crash:

Unpack attached tar file, creating a directory "bug".
Change bug/bug.ecf l
....
Output truncated, Click download to get the full message

Attachment: refinalize_bug.tar.bz2 Size:1387560

Attachment: stack17.txt Size:4912

From:prestoat2000 Date:2010/01/27 Download

The crash is also reproducible on Solaris SPARC 32-bit.

From:prestoat2000 Date:2010/01/27 Download

Although it is probably not useful, here is one the exception traces
I got (when compiling with "ec -loop" without -verbose and without
inlining) after the re-finalize.  Top of the stack looks like:

-------------------------------------------------------------------------------
FEATURE_SERVER      item @6                                             
<000000000E873404>  (From COMPILER_SERVER) Feature call on void target.  Fail
-------------------------------------------------------------------------------

In this case, there is a call on Void target.  Another time I got
a seg fault with a completely different trace.  Full trace attached.

Attachment: stack16.txt Size:4609