PR# 16820 External Compilation window has garbage due to invalid UTF-8 strings

Problem Report Summary
Submitter: prestoat2000
Category: EiffelStudio
Priority: High
Date: 2010/06/08
Class: Bug
Severity: Critical
Number: 16820
Release: 6.6.83355
Confidential: No
Status: Closed
Responsible: ted_eiffel
Environment: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.9.0.10) Gecko/2009042715 Firefox/3.0.10 Solaris 10 on SPARC
Synopsis: External Compilation window has garbage due to invalid UTF-8 strings

Description
When I freeze or finalize a trivial application in estudio with 6.6, the External
Compilation window has a bunch of lines consisting of repeated "X" characters.
In the xterm window from which estudio was started, I get many occurrences of:

(ec:16160): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()

This makes estudio unusable when C compilations are involved, since we can't see
what is happening.  We need to have this fixed before the Enterprise Edition release.

LANG environment variable is set to "C" in case that matters.  This problem does not
occur on 6.5.  Perhaps there is some other environment variable that is causing problems,
but I haven't changed anything as far as I know.

To Reproduce
Freeze or finalize with attached class and config file in estudio.
Examine External Compilation window contents.
Problem Report Interactions
From:prestoat2000    Date:2010/07/01    Status: Closed    Download   
This problem seems to be fixed in rev 83767, on both Solaris 9 and
Solaris 10.  Closing report.

From:ted_eiffel    Date:2010/06/25    Download   
There are some tests under the library folder. However the tests are indeed not well tested on big-endian machines. (Some tests need to be commented before testing, depending which character sets are available for `iconv')

From:prestoat2000    Date:2010/06/25    Download   
With the patch applied to ENCODING_IMP, the test case gives the correct
output.  So it looks like making this change will fix the problem with invalid
UTF-8 strings in estudio.

I didn't test any other conversions - just 646 to UTF-32.  Someone should
write an eweasel test that converts various strings between different
encodings and compares `last_converted_stream' to different values,
depending on whether current platform is big endian or little endian, to
see if they are correct.


From:ted_eiffel    Date:2010/06/25    Download   
Dear David,

First of all, thank you very much for providing the detailed information. It is indeed not connvenient for me to debug like you do on a sparc machine. I attached the patch which should address the problem. Could you give it another try?

Attachment: encoding_endian_bug.patch     Size:5931
From:prestoat2000    Date:2010/06/24    Download   
This problem is due to a bug in the encoding library.  I have attached a test program that
demonstrates that conversion is not always done correctly.

The attached test case in test.e creates a source encoding "646" and a destination
encoding "UTF-32".  It then converts three different strings to the destination
encoding and displays the results, showing the individual codes.  You can
see that the codes are ridiculously large numbers.

On Solaris 10, all three strings are converted incorrectly (see attached output_solaris10).
On Solaris 9, only the first string is converted incorrectly (see output_solaris9).

I stepped through {ENCODING_IMP}.convert_to and verified that on Solaris 9
there is no byte order mark (BOM) for the second and third conversions, though
I don't know why.

I believe the bug is that the code of the first character is passed to
`bom_little_endian' as a NATURAL_32.  If the first character is a BOM,
its value must be 0xFEFF so the comparison with 0xFEFF in
`bom_little_endi
....
Output truncated, Click download to get the full message

Attachment: test.e     Size:1458
Attachment: output_solaris10     Size:648
Attachment: test.ecf     Size:1153
Attachment: output_solaris9     Size:568
From:prestoat2000    Date:2010/06/23    Download   
I built a workbench estudio and tried to find out what was wrong using the
debugger.  I stopped in {ENCODING}.convert_to and found that the
source code page is 646 and destination code page is UTF-32.
Then I stepped into {ENCODING_IMP}.convert_to.  The call to set
`l_converted' by calling `pointer_to_string_32' appeared to work correctly.
I could look at `l_converted' in the debugger and it looked reasonable.

Then the code looked at the byte order mark (BOM) and decided the
string was little endian.  Since the destination code page was no-endian
and since the platform was not little endian, the code removed the BOM and called
`string_32_switch_endian'.  After this call, none of the characters were
valid unicode characters.

So the problem seems to be that the routine thinks the value returned from
`iconv' and placed into a STRING_32 starts with a BOM that indicates little endian.

I don't yet know why it (mostly) works on Solaris 9.

From:prestoat2000    Date:2010/06/22    Download   
If you cannot reproduce this on your Solaris 9 host (i.e., if even
the first line of output from the first freeze is OK), then perhaps
you can build a workbench version of estudio for Solaris SPARC
and send it to me.  I can use the debugger, under your direction if
necessary, and hopefully you can determine the cause.

I haven't heard anything further about this bug and we need it fixed
before we can upgrade to 6.6.  I'm willing to wait a month (or even
two months), but I don't want to wait until 2011 to get a new version.

From:prestoat2000    Date:2010/06/18    Download   
Actually on Solaris 9, it seems that the first freeze shows garbage in the first
line and subsequent freezes do not show any garbage.

From:prestoat2000    Date:2010/06/18    Download   
I have now noticed that this problem occurs partially on Solaris 9 also.
But on Solaris 9 only the first line is garbage, which is the line:

   Eiffel C/C++ Compilation Tool - Version 6.5

(except that it shows up as unprintable characters).  I also get the lines like

   (ec:11050): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()

but just not as many.

So perhaps you can reproduce this on Solaris 9.

From:prestoat2000    Date:2010/06/15    Download   
The result is identical on both machines:

LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

The LANG environment variable was set to "C" on Solaris 10 host, but I unset it
and it didn't help.

From:manus_eiffel    Date:2010/06/15    Download   
Check what `locale' returns on both machine.

From:prestoat2000    Date:2010/06/15    Download   
I don't think this is a GTK version issue.  It works on Solaris 9 using the same GTK
libraries (2.14.4) but does not work on our Solaris 10 hosts.  I'm still trying to figure
out why.  If you have any ideas, let me know.  Could be due to an environment variable,
different version of some library (but *not* the GTK libraries, since the same ones
are being used on both platforms) or on some package that is installed on one
OS but not on the other.


From:manus_eiffel    Date:2010/06/15    Download   
In 6.6, we are now translating the C compiler output from whatever input format it is to proper encoding. Before I believe this conversion was not done. The reason for it is that by default on Linux, console outputs are in UTF-8 and some characters were incorrectly written because we actually assumed ASCII encoding.

From:prestoat2000    Date:2010/06/15    Download   
It also works fine on OpenSolaris snv_134 on x86.  I don't know
what version of GTK and associated libraries are being used.

This could be a bug in GTK+ but we have been using this same
version for years.  This problem does not occur in 6.5.

Can you give me any hints about what might have changed between
6.5 and 6.6 that could possibly cause this problem?

From:manus_eiffel    Date:2010/06/14    Download   
I've also tried on Solaris 10 x86 64-bit with GTK 2.4.9 and it works fine there too.

From:manus_eiffel    Date:2010/06/14    Status: Analyzed    Download   
I've tried on Solaris Sparc 32-bit with Solaris 9 and the output of the C compilation is normal to me and there is no Pango issue. The GTK version we are using is GTK 2.6.9 and I'm wondering if it is not a bug in a newer version of GTK you are using?

From:prestoat2000    Date:2010/06/08    Download   
Attachments for problem report #16820

Attachment: test.ecf     Size:898
Attachment: test.e     Size:59