PR# 16820 External Compilation window has garbage due to invalid UTF-8 strings
Problem Report Summary
Submitter: prestoat2000
Category: EiffelStudio
Priority: High
Date: 2010/06/08
Class: Bug
Severity: Critical
Number: 16820
Release: 6.6.83355
Confidential: No
Status: Closed
Responsible: ted_eiffel
Environment: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.9.0.10) Gecko/2009042715 Firefox/3.0.10
Solaris 10 on SPARC
Synopsis: External Compilation window has garbage due to invalid UTF-8 strings
Description
When I freeze or finalize a trivial application in estudio with 6.6, the External Compilation window has a bunch of lines consisting of repeated "X" characters. In the xterm window from which estudio was started, I get many occurrences of: (ec:16160): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text() This makes estudio unusable when C compilations are involved, since we can't see what is happening. We need to have this fixed before the Enterprise Edition release. LANG environment variable is set to "C" in case that matters. This problem does not occur on 6.5. Perhaps there is some other environment variable that is causing problems, but I haven't changed anything as far as I know.
To Reproduce
Freeze or finalize with attached class and config file in estudio. Examine External Compilation window contents.
Problem Report Interactions
This problem seems to be fixed in rev 83767, on both Solaris 9 and Solaris 10. Closing report.
There are some tests under the library folder. However the tests are indeed not well tested on big-endian machines. (Some tests need to be commented before testing, depending which character sets are available for `iconv')
With the patch applied to ENCODING_IMP, the test case gives the correct output. So it looks like making this change will fix the problem with invalid UTF-8 strings in estudio. I didn't test any other conversions - just 646 to UTF-32. Someone should write an eweasel test that converts various strings between different encodings and compares `last_converted_stream' to different values, depending on whether current platform is big endian or little endian, to see if they are correct.
Dear David, First of all, thank you very much for providing the detailed information. It is indeed not connvenient for me to debug like you do on a sparc machine. I attached the patch which should address the problem. Could you give it another try?
This problem is due to a bug in the encoding library. I have attached a test program that demonstrates that conversion is not always done correctly. The attached test case in test.e creates a source encoding "646" and a destination encoding "UTF-32". It then converts three different strings to the destination encoding and displays the results, showing the individual codes. You can see that the codes are ridiculously large numbers. On Solaris 10, all three strings are converted incorrectly (see attached output_solaris10). On Solaris 9, only the first string is converted incorrectly (see output_solaris9). I stepped through {ENCODING_IMP}.convert_to and verified that on Solaris 9 there is no byte order mark (BOM) for the second and third conversions, though I don't know why. I believe the bug is that the code of the first character is passed to `bom_little_endian' as a NATURAL_32. If the first character is a BOM, its value must be 0xFEFF so the comparison with 0xFEFF in `bom_little_endi .... Output truncated, Click download to get the full message
I built a workbench estudio and tried to find out what was wrong using the debugger. I stopped in {ENCODING}.convert_to and found that the source code page is 646 and destination code page is UTF-32. Then I stepped into {ENCODING_IMP}.convert_to. The call to set `l_converted' by calling `pointer_to_string_32' appeared to work correctly. I could look at `l_converted' in the debugger and it looked reasonable. Then the code looked at the byte order mark (BOM) and decided the string was little endian. Since the destination code page was no-endian and since the platform was not little endian, the code removed the BOM and called `string_32_switch_endian'. After this call, none of the characters were valid unicode characters. So the problem seems to be that the routine thinks the value returned from `iconv' and placed into a STRING_32 starts with a BOM that indicates little endian. I don't yet know why it (mostly) works on Solaris 9.
If you cannot reproduce this on your Solaris 9 host (i.e., if even the first line of output from the first freeze is OK), then perhaps you can build a workbench version of estudio for Solaris SPARC and send it to me. I can use the debugger, under your direction if necessary, and hopefully you can determine the cause. I haven't heard anything further about this bug and we need it fixed before we can upgrade to 6.6. I'm willing to wait a month (or even two months), but I don't want to wait until 2011 to get a new version.
Actually on Solaris 9, it seems that the first freeze shows garbage in the first line and subsequent freezes do not show any garbage.
I have now noticed that this problem occurs partially on Solaris 9 also. But on Solaris 9 only the first line is garbage, which is the line: Eiffel C/C++ Compilation Tool - Version 6.5 (except that it shows up as unprintable characters). I also get the lines like (ec:11050): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text() but just not as many. So perhaps you can reproduce this on Solaris 9.
The result is identical on both machines: LANG= LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_ALL= The LANG environment variable was set to "C" on Solaris 10 host, but I unset it and it didn't help.
Check what `locale' returns on both machine.
I don't think this is a GTK version issue. It works on Solaris 9 using the same GTK libraries (2.14.4) but does not work on our Solaris 10 hosts. I'm still trying to figure out why. If you have any ideas, let me know. Could be due to an environment variable, different version of some library (but *not* the GTK libraries, since the same ones are being used on both platforms) or on some package that is installed on one OS but not on the other.
In 6.6, we are now translating the C compiler output from whatever input format it is to proper encoding. Before I believe this conversion was not done. The reason for it is that by default on Linux, console outputs are in UTF-8 and some characters were incorrectly written because we actually assumed ASCII encoding.
It also works fine on OpenSolaris snv_134 on x86. I don't know what version of GTK and associated libraries are being used. This could be a bug in GTK+ but we have been using this same version for years. This problem does not occur in 6.5. Can you give me any hints about what might have changed between 6.5 and 6.6 that could possibly cause this problem?
I've also tried on Solaris 10 x86 64-bit with GTK 2.4.9 and it works fine there too.
I've tried on Solaris Sparc 32-bit with Solaris 9 and the output of the C compilation is normal to me and there is no Pango issue. The GTK version we are using is GTK 2.6.9 and I'm wondering if it is not a bug in a newer version of GTK you are using?