PR# 19069 [RJ] HASH_TABLE.correct_mismatch incorrectly decodes newer versions.

Problem Report Summary
Submitter: axarosenberg
Category: EiffelBase
Priority: Medium
Date: 2015/05/11
Class: Bug
Severity: Serious
Number: 19069
Release: 15.01.9.6535
Confidential: No
Status: Open
Responsible:
Environment: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Synopsis: [RJ] HASH_TABLE.correct_mismatch incorrectly decodes newer versions.

Description
HASH_TABLE.correct_mismatch will interpret a storable with a newer version as a 5.4 storable.  The solution is to use the indexing clause "storable_version".
   Randy

Here is the last email sent:

See [RJ]colored responses below.

From: Emmanuel Stapf [mailto:manus@eiffel.com] 
Sent: Monday, May 11, 2015 06:18 AM
To: Randy John
Subject: RE: B_LAST bug?

Hi Randy,

Just to clarify the issue, can you configurm that the bug is in the version of HASH_TABLE of EiffelStudio 7.0 in compatible mode (i.e. non-void-safe code) which doesn’t know about the new version of HASH_TABLE in regular mode (i.e. void-safe code)?  [RJ] Yes, but I don't see the point.  Even today HASH_TABLE assumes that anything that it doesn't recognize is 5.4.  I used C:\EStudio\64bit\7.0.8.8074\compatible\library\base\elks\structures\table\hash_table.e.

This is definitely a problem in the way we think about storables, we always focus on retrieving old stuff, and not the other way around.  )?  [RJ]  Exactly.

I agree that using `version’ in the note clause would be best to identify our storables from now on. This was added in version 6.6, so it would also have worked with 7.0. The issue is that to fix your particular issue we still need to fix this in 7.0. )?  [RJ] No.  Any fix that you come up with would involve a recompile.  We may as well just use 15.01 (or whatever).  I merely want HASH_TABLE changed to avoid any future problems.  So we are in a situation where no matter how well we fix `correct_mismatch’ in storable this would not help in retrieving the storable in 7.0. The other complexity is that for a while both version of HASH_TABLEs have evolved independently but luckily for us this is not so much of an issue as compatible is almost not used anymore.  )?  [RJ] Except for barnacles like my example, we no longer use compatible.

If we move on to using the note clause for versioning, I think that if `correct_mismatch’ encounters a newer version, it will have to raise an exception and will not try to retrieve the newer version of HASH_TABLE. Does this sound ok to you?  )?  [RJ] Yes, but this really scares me.  You must be very, very, very careful to only change the version if a true breaking change has occurred.  Consider this example.  You add "name: STRING -- Fancy name of this table." to HASH_TABLE.  There is no need to change the version number since this is not a breaking change.  Older code can simply ignore 'name'.  In the next version you change it to "name: ARRAY[STRING]".  This is a terrible situation.  You need to break the previous version (which needs a STRING and has no code to deal with an ARRAY) but you should not break the versions before that one (no attribute called 'name').   I am envisioning a horribly complicated system whereby the new code gives instructions on how old code should interpret the storable.  For instance, given that version 1 has no attribute called 'name', version 2 uses "name: STRING" and version 3 uses "name: ARRAY[STRING]", in version 3:

                name: ARRAY[STRING]
                                note
                                                storable_version 1
                                                                transient
                                                end
                                attribute
                                end

Another way would be to put a 'once' function in MISMATCH_CORRECTOR:

                transient_rules: HASH_TABLE[HASH_TABLE[BOOLEAN, STRING], STRING]

The outer table would be indexed by class name, the inner by attribute name.  If the value is True, the attribute would be ignored.  Of course, the table would have to be loaded.  We load a configuration file at the start of each program so that would be simple for us.

OK, those are both awful.  What I am really saying is, don't break our code!

As for request #2, I’ll have a look to prevent out of bounds array by adding a new ECF setting.

Regards,
Manu 

From: Randy John [mailto:rjohn@axarosenberg.com] 
Sent: Saturday, May 09, 2015 02:34
To: manus@eiffel.com
Cc: Randy John
Subject: RE: B_LAST bug?

OK, I finally got it working.  It's not really a B_LAST problem so much as it is memory corruption.

I made a copy of full_sweep called full_sweep_test.  I modified it so it would test the last object in the chunk to see if B_LAST was set.  By placing the call in various places I finally pinned down the problem to egc_correct_mismatch.  The object being corrected was always a HASH_TABLE.  In my system one program creates a storable with hash_table_version_64 and passes it to another that expects hash_table_version_57 (7.0.8.8074 compatible).  Here is a snippet from the old HASH_TABLE.correct_mismatch (it's nearly the same today):

                                                -- Fortunately enough we can simply compare the counts of
                                                -- `deleted_marks' and `keys'. If they are the same it is 5.5 or 5.6,
                                                -- otherwise it is 5.4 or older.
                                if deleted_marks.count /= keys.count then
                                                l_old_deleted_marks := deleted_marks
                                                create deleted_marks.make_empty (keys.count)
                                                deleted_marks.copy_data (l_old_deleted_marks, 0, 0, l_old_deleted_marks.count)
                                end

There is an assumption that when the counts do not match, the storable must be 5.4 or older.  However, in 91731, HASH_TABLE.make was changed from this:

                                                create keys.make (l_size + 1)
                                                create deleted_marks.make (l_size + 1)

to this:

                                                create keys.make_empty (n + 1)
                                                create deleted_marks.make_filled (False, n + 1)

In the first version, the counts were the same, now they are not.  "Counts do not match" no longer implies 5.4 or older!

Going back to correct_mismatch we see that 'deleted_marks' is created with size "keys.count" (which was small in my case) but the number of items copied is "l_deleted_marks.count" (which is large).  The call to 'copy_data' uses 'put' (changed to 'force' in 91731) and this wrote right past the end of the SPECIAL and into the next object.  If the next object was the last in the chunk, B_LAST could be cleared.  The change to 'force' does nothing since it cannot change 'capacity'.

================================================

Request 1:

I believe that this bug still exists today, but I haven't tested.  Oh, not now.  But as soon as  a new version of HASH_TABLE is released (say, hash_table_version_99), all current executables will suffer from this problem because they don't know about hash_table_version_99 and will check to see if it is an older version.  This is inefficient and potentially dangerous.  Now I know what you are saying.  All Randy has to do is recompile all of his programs and the problem will go away.  No!  As you can see I'm still using programs compiled with 7.0!

The first problem is the test "not mismatch_information.has ("hash_table_version_64")".  What you really want to know is, is the version less than 64.  But what you are getting is, is the version NOT EQUAL to 64.  Future storable versions will cause problems.  You should add a storable_version indexing clause so you can start doing a proper "less than" test (that doesn't break existing code, does it?).  I envision the first two lines of 'correct_mismatch' to be:
                if mismatch_information.stored_version = Void or else mismatch_information.stored_version < "64" then
                                if not mismatch_information.has ("hash_table_version_64") then

You still need the test for "hash_table_version_64" to support existing storables.  You can't remove the attribute 'hash_table_version_64' because existing programs don't know to look for 'stored_version'.  If this is done the next problem will go away.

The second problem is that your test for 5.4 is flawed.  You could check 'mismatch_information' to see if there is any attribute starting_with("hash_table_version_").  This works well but you would have to check all of the keys in 'mismatch_information'.

Thirdly, you should check if the capacity is OK before calling 'copy_data'.  You could simply use this as a 5.4 test.  After all, if the test fails, what would you do?  Skip the copy, resize properly or fail outright.  If you choose to skip, it's really the same as a 5.4 test.

As usual, please don't break our existing storables!

================================================

Request 2:

We have talked about this before.  I am still wishing that I could build a finalized exe with just the precondition for SPECIAL.put turned on.  Depending upon the severity of the performance hit, I think we would run like this in production all of the time.  In my example, the program ran for years without problem so initial testing with assertions on proves nothing except that it won't crash on the first day it is released.  Actually, crashing is good.  I'm worried about corruption that simply changes some data but doesn't crash.  When I finalized my program and kept assertions, I recovered from the failure, closed the socket to the client that sent the storable and continued execution.  That's behavior that I like.

================================================

After you respond I will submit a proper bug report.

Randy

p.s.  This all would have been a lot easier if I had turned on assertions right from the start.  I just did and it found the problem straight away.  But as soon as I saw that B_LAST was wrong I jumped to the conclusion (again) that the run-time was at fault.  The other thing is that stupid HASH_TABLE.  It, and several other large attributes, aren't needed in the storable.  All that was needed was an INTEGER and a DATE, but somebody decided that a descendant with 8 MB of data would be a good idea.  I'll have to do something about that.

From: Emmanuel Stapf [mailto:manus@eiffel.com] 
Sent: Tuesday, May 05, 2015 11:49 AM
To: Randy John
Subject: RE: B_LAST bug?

Sorry about the msi stuff, there is a way to force it but I don’t think it is worth it. You can find the non-MSI version at:

ftp://ftp.eiffel.com/pub/download/OLD_RELEASES/70/Eiffel70_gpl_88074-win64.7z

and to use just make sure that both ISE_EIFFEL and ISE_PLATFORM are defined.

Manu

From: Randy John [mailto:rjohn@axarosenberg.com] 
Sent: Tuesday, May 05, 2015 20:08
To: manus@eiffel.com
Subject: RE: B_LAST bug?

The trick is not running Eiffel Studio, but installing it.  I tried to install Eiffel70_gpl_88074-win64.msi.  It doesn't detect a C compiler and it won't let me continue unless I install a C compiler (or let it install MinGW - which I don't want to do).  I even fudged the registry to the point where the install saw the version 12 compiler, but it recognized it as 32-bit (it's the free version which cross compiles to 64-bit).  Is there a way to force it?
                Randy

From: Emmanuel Stapf [mailto:manus@eiffel.com] 
Sent: Tuesday, May 05, 2015 05:48 AM
To: Randy John
Subject: RE: B_LAST bug?

In theory, you do not need to reinstall a C compiler, EiffelStudio can use whatever you have. Just make sure to configure the C compiler before launching EiffelStudio as done in the launcher.

I’ll keep this bug in mind.

Manu

From: Randy John [mailto:rjohn@axarosenberg.com] 
Sent: Tuesday, May 05, 2015 01:40
To: manus@eiffel.com
Subject: RE: B_LAST bug?

Last week was the first time I've seen it and the program has run for years.  I'm giving up.  I can't install 7.0 because my C compiler is too new.  I can't install the 7.1 SDK because the install fails.  I can't install 7.0 on a machine with VC 9 because I don't have permission.  I've spent too much time and we have a work around.  Thanks for your help.
                Randy

From: Emmanuel Stapf [mailto:manus@eiffel.com] 
Sent: Monday, May 04, 2015 12:59 PM
To: Randy John
Subject: RE: B_LAST bug?

Thanks Randy. I do have that version of the compiler and thus all the matching source code available. We would just need whatever is required on your end to reproduce the issue. When was this detected?

Regards,
Manu

From: Randy John [mailto:rjohn@axarosenberg.com] 
Sent: Monday, May 04, 2015 20:46
To: manus@eiffel.com
Subject: RE: B_LAST bug?

It's going to be too difficult to do that.  In addition to finding an old Eiffel compiler, an old C compiler and old source code, the data was fixed to prevent this from happening.  Just keep this in the back of your brain.
                Randy

From: Emmanuel Stapf [mailto:manus@eiffel.com] 
Sent: Friday, May 01, 2015 12:52 AM
To: Randy John
Subject: RE: B_LAST bug?

Hi,

I searched all the logs up to rev 88000 which is when 7.0 was released. I could not find a reference to a bug with B_LAST. Do you know if you can recompile with 7.0.8.8074 and see if we can reproduce the bug? If so, it is going to be easy to see why it does not fail in the latest version.

Manu

From: Randy John [mailto:rjohn@axarosenberg.com] 
Sent: Friday, May 01, 2015 03:24
To: manus@eiffel.com
Subject: B_LAST bug?

Manu,
                Do you ever recall a bug that caused B_LAST to be clear instead of set?  I'm running a program built with 7.0.8.8074 which is clearly exhibiting this behavior.  I'm in coalesce and the memory after my object is not allocated (it's at the end of a C allocation).  I built it with 15.01 and can't reproduce the problem (although that is no guarantee).  Also, the object appears to be larger than the chunk.  I will be out Friday.
                Randy

The chunk appears to start at 172e0040, is type=1 and its size is 06666670 (I set EIF_CHUNK_SIZE to 06666666) but the size of the first object is 06666f70!

00000000`172e0038 46137374 04000000 00000001 00000000
00000000`172e0048 1d950040 00000000 10590040 00000000
00000000`172e0058 1d950040 00000000 10590040 00000000
00000000`172e0068 06666670 00000000 00000000 00000000
00000000`172e0078 19700b76 00000520 06666f70 80000000
00000000`172e0088 00000000 00000000 00000000 00000000

|-       0`172de000        0`172e0000        0`00002000 MEM_PRIVATE MEM_COMMIT  PAGE_READWRITE                     Stack [1780.c68; ~5]
*        0`172e0000        0`1d947000        0`06667000 MEM_PRIVATE MEM_COMMIT  PAGE_READWRITE                     <unclassified> 
*        0`1d947000        0`1d950000        0`00009000             MEM_FREE    PAGE_NOACCESS                      Free

To Reproduce

										
Problem Report Interactions