A Complaint of ODF’s Asian Language Support

I have recently read about news about better support for ODF from Google. The author then went on to complain that neither Google nor Microsoft makes ‘it easy to use ODF as part of a workflow’. This reminds me that maybe I should write down a long-time complaint I have for ODF.

I have always loved open standards. However, there are not only open and proprietary standards, there are also good and bad standards. ODF looks pretty bad regarding Asian language support. It can be powerfully demonstrated by this image:

ODF Issue

If you are interested in it, you can download the document yourself. It simply contains four lines:

  • The first line has a left quotation mark, the English word ‘Test’, and the corresponding Chinese word. It looks OK.
  • The second line is a duplication of the first line, with an additional colon added at the beginning. It immediately changes the font of the left quotation mark.
  • The third line is a duplication of the second line, with the Chinese word removed. Both quotation marks are now using the default Western font ‘Times New Roman’.
  • The fourth line is a duplication of the third line, with the leading colon removed. Weirdly enough, the left quotation mark now uses the Chinese font. (This may be related to my using the Chinese OpenOffice version or Chinese Windows OS.)

Is it ridiculous that adding or removing a character can change how other characters are rendered? Still, I would not blog about it, if it had only been a bug in OpenOffice (actually I filed three bug reports back in 2006—more ancient than I thought—and this bug remains unfixed ). It actually seems a problem in the ODF standard. After extracting the content from the .ODT file (as a zip file), I can shrink the content of the document to these XML lines (content.xml with irrelevant contents removed and the result reformatted):

<office:font-face-decls>
<style:font-face
    style:name="Times New Roman"
    style:font-family-generic="roman"
    style:font-pitch="variable"/>
<style:font-face
    style:name="宋体"
    style:font-family-generic="system"
    style:font-pitch="variable"/>
</office:font-face-decls>
<office:automatic-styles>
<style:style
    style:name="P1"
    style:family="paragraph"
    style:parent-style-name="Standard">
<style:text-properties
    fo:font-size="12pt" fo:language="en" fo:country="GB"
    style:language-asian="zh" style:country-asian="CN"/>
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
<text:p text:style-name="P1">“Test测试”</text:p>
<text:p text:style-name="P1">:“Test测试”</text:p>
<text:p text:style-name="P1">:“Test”</text:p>
<text:p text:style-name="P1">“Test”</text:p>
</office:text>
</office:body>

The problem is that instead of specifying a single language on any text, it specifies both a ‘fo:language’ and a ‘style:language-asian’. The designer of this feature definitely did not think carefully about the fact that many symbols exist in both Asian and non-Asian contexts and can often be rendered differently!

When I repeated the same process in Microsoft Word (on Windows), all text appeared correctly—Microsoft applications recognize which keyboard I use and which language it represents. Pasting as plain text introduced one error (as no language information is present). Even in that case, fixing the problem is easier. In OpenOffice I have to change the font manually, but in Microsoft Word I only need to specify the correct language (‘Office, this is English, not Chinese’). It is much more intuitive and natural.

I also analysed the XML in the resulting .DOCX file. Its styles.xml contained this:

<w:lang w:val="en-US" w:eastAsia="zh-CN" w:bidi="ar-SA"/>

So these are default languages. I had to use UK English and Traditional Chinese to force Word to specify the languages in the document explicitly. The embedded document.xml now contains content like the following:

<w:p>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="PMingLiU" w:hint="eastAsia"/>
<w:lang w:eastAsia="zh-TW"/>
</w:rPr>
<w:t>“</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="PMingLiU"/>
<w:lang w:val="en-GB" w:eastAsia="zh-TW"/>
</w:rPr>
<w:t>Test</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="PMingLiU" w:hint="eastAsia"/>
<w:lang w:eastAsia="zh-TW"/>
</w:rPr>
<w:t>測試”</w:t>
</w:r>
</w:p>
...
<w:p>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="PMingLiU"/>
<w:lang w:val="en-GB" w:eastAsia="zh-TW"/>
</w:rPr>
<w:t>“Test”</w:t>
</w:r>
</w:p>

We can argue the structure is somewhat similar (compare ‘w:val’ in <w:lang> with ‘fo:language’ and ‘fo:country’, and ‘w:eastAsia’ with ‘style:language-asian’ and ‘style:country-asian’), but the semantics are obviously different, and text of different languages is not mixed together. The English text has the language attribute <w:lang w:val="en-GB" w:eastAsia="zh-TW"/>, and the Chinese text has only <w:lang w:eastAsia="zh-TW"/>. It looks to me a more robust approach to processing mixed text.

Although it might be true that Microsoft lobbied strongly to get OOXML approved as an international standard, I do not think ODF’s openness alone is enough to make people truly adopt it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s