Internationalisation Take 2 - Zend vs Cheap-o Arrays

Saturday, June 23rd, 2007

In response to my emails to the Zend Framework I18N list and my previous post, Thomas, the author of the Zend_Translate framework items mailed back to the list here:

> 2) gettext is a more expensive version of using the arrays backend.

No… it’s a less expensive version. What takes time is reading the original
source. Your processor is always faster than your harddrive.
It is better to do some computations than reading a bigger file. And mo
files are much smaller than the same sized array files.

This still seems wrong to me, so I’ve done a bit more analysis. I have now got XDebug up and running in my portable environment, so I can really see the details of the costs. Now, to caveat all this, I’m running all this from a USB key on a laptop that’s doing a number of other background tasks, so, the performance is not isolated. Due to this I’ll be looking at percentages of time in Wincachegrind, not actual execution times.

Now, to test this I have generated two files. One of which is a .po containing 1000 phrases which I have compiled to a .mo file. The other is a PHP array in a PHP file containing the same 1000 translations. I generated this with a script, the translations are a bit simple:

From the .po file:

msgid “String 0″
msgstr “String Translated 0″

From the .php file:

‘String 0′=> ‘String Translated 0′,

I have then written a simple PHP file which translates 50 of these items. A reasonable enough test I think. Firstly, to test the translation using the fast Zend_Translate gettext options:

require_once 'Zend/Translate.php';
i = new Zend_Translate('gettext', '/development/language/test.en.mo', 'en');
 
function _($s)
{
    global $i;
    $s = $i->_($s);
    echo($s."<br/>\n");
}
 
_('String 1');
_('String 2');
...

I then ran this file and loaded the cachegrind output into WinCacheGrind. 87.88% of the execution time was spent in Zend_Translate_Adapter_Gettext->_loadTranslationData. Performing translations took 1.99% of the time.

Next I used my PHP array and the Zend_Translate array backend:

require_once 'Zend/Translate.php';
require_once '/development/language/test.en.php';
$i = new Zend_Translate('array', $LANGUAGE, en);

(The rest of the file remains the same). I then ran this and checked the output. loadTranslationData took 78.36% of the time. Performing translations took 2.86% of the time.

My third test was just to use the test.en.php file and a simple translation function:

require_once '/development/language/test.en.php';
function _($s) {
  global $LANGUAGE;
   $s = (array_key_exists($s, $LANGUAGE)) ? $LANGUAGE[$s] : $s;
  return $s;
}

The first thing to note was that the Zend_Translate items took over 20ms each. This one not using Zend at all took 2.8ms. The require_once statement took 1.83% of that time. Then it was just repetition of an un-recordably-fast translation 50 times.

So what do I draw from this? I draw from this that for simple translations, you can’t beat a very, very simple system with just an array of translations. I haven’t looked in any depth at the other services offered by Zend_Translate, but it does allow you to add multiple translations and translate in multiple languages. But, do you have a use-case for that?

If your UI needs to display in a single language, but translate that language, take the simple approach. It needs a little extension to support modular languages, but look at the PHPBB3 implementation and you can’t go far wrong. That loads modular translation files (just to keep that trivial require_once cost down) each of which array_merge’s back into a single translation array which is key’d by constants.

Fast.

My cachegrind files for your reading pleasure:
Zend_Translate - Array
Zend_Translate - Gettext
Non-Zend_Translate

Popularity: 78% [?]

Internationalisation

Thursday, June 21st, 2007

The web is global.

Lots of websites do not cope with this. They do not provide a user setting for the language and deliver their content in that language.

Clearly, this is bad. If you are producing an application, like Multiblog, then you need to make it international. It needs the UI at least (content is a more thorny issue) to work in the users preferred language. Otherwise they will experience friction trying to use the confusing foreign thing.

There are a lot of ways to achieve this. Geeklog and PHPBB3 use arrays to translate content and allow the user to pick things. Drupal uses the GNU GetText system. And there are other approaches.

Choosing the right approach and using it correctly is difficult. I’m currently experimenting with approaches for Multiblog and other projects. I’m currently looking into the very interesting Zend Framework’s Zend_Translate class. This allows a number of different approaches, including both Array and GetText.

GetText appears to be the recommended choice. There are a number of free tools that can generate your translation files, as the translation files are not human readable. It’s fast and threadsafe. The Zend Framework Manual offers some advice on how to structure your translation files. There are several suggested methods, but, there is no suggestion of how to structure your translation modules.

The question I asked was “What’s the best practice?”, and no-one seems to know, so I guess I need to figure it out for myself from basic principles.

Now, GetText was written to provide internationalisation for the GNU software. Including the core of the Linux OS. Here, the GetText file is (I assume) parsed once at start up and held in memory to translate as things go. Web applications are different. Every page view is essentially a new start up. That GetText translation source is going to be loaded hundreds and thousands of times. Not just once on boot of the web server.

So, if we want to get this right, we need to know what our best bet is. Do we want a monolithic all translations file, or do we want to modularise this file and load it as needed? Does it use the file like a database and seek things out, or does it parse the whole thing every time and process it internally?

I’ve done some simple testing. I produced a basic test catalogue with poEdit and compiled a mo file from it:

msgid ""
msgstr ""
"Project-Id-Version: Test Zend GetTextn"
"POT-Creation-Date: n"
"PO-Revision-Date: 2007-06-21 12:20-0000n"
"Last-Translator: THEMike n"
"Language-Team: n"
"MIME-Version: 1.0n"
"Content-Type: text/plain; charset=utf-8n"
"Content-Transfer-Encoding: 8bitn"
"X-Poedit-Language: Englishn"
"X-Poedit-Country: UNITED KINGDOMn"
"X-Poedit-SourceCharset: utf-8n"
msgid "This is a test."
msgstr "[Translated]This is a test.[/Translated]"

I then wrote a simple test harness PHP file which loads a Zend_Translate using gettext and translates a single line. Before performing a translation, I var_dump the Zend_Translate instance to see what’s in it:

  <?php
  /* Configuration: */
  define('PATH_TO_ENGINE', '/development/engine/');
  define('PATH_TO_LANGUAGE', '/development/language/');/* Put engine on the include path */
$curPHPIncludePath = ini_get( 'include_path' );
if (defined( 'PATH_SEPARATOR')) {
    $separator = PATH_SEPARATOR;
} else {
    // prior to PHP 4.3.0, we have to guess the correct separator ...
    $separator = ';';
    if( strpos( $curPHPIncludePath, $separator ) === false ) {
        $separator = ':';
    }
}
if (ini_set('include_path', PATH_TO_ENGINE . $separator . $curPHPIncludePath) === false){
        die('Buggered');
}
require_once 'Zend/Translate.php';
$t = new Zend_Translate('gettext', PATH_TO_LANGUAGE.'test.en.mo', 'en');
echo('<pre>');var_dump($t);echo("</pre><hr/>n");echo($t->_('This is a test.'));?>

The result of the var_dump being:

object(Zend_Translate)#1 (1) {
  ["_adapter:private"]=>
  object(Zend_Translate_Adapter_Gettext)#2 (6) {
    ["_bigEndian:private"]=>
    bool(false)
    ["_file:private"]=>
    resource(21) of type (stream)
    ["_locale:protected"]=>
    string(2) "en"
    ["_languages:protected"]=>
    array(1) {
      ["en"]=>
      string(2) "en"
    }
    ["_options:protected"]=>
    array(1) {
      ["clear"]=>
      bool(false)
   }
    ["_translate:protected"]=>
    array(1) {
      ["en"]=>
      array(2) {
        [""]=>
        string(339) "Project-Id-Version: Test Zend GetText
POT-Creation-Date:
PO-Revision-Date: 2007-06-21 12:20-0000
Last-Translator: THEMike
Language-Team:
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Poedit-Language: English
X-Poedit-Country: UNITED KINGDOM
X-Poedit-SourceCharset: utf-8
"
        ["This is a test."]=>
        string(40) "[Translated]This is a test.[/Translated]"
      }
    }
  }
}

As you can see, before I’ve even called a translate call, the entire mo translation catalogue has been loaded into memory and parsed internally to form a PHP array. Which is then used for translation.

Clearly, this indicates a very modular translation system. I would want a core.lang.mo file for “common” translations used througout the application and then a controller.lang.mo file for each controller that had that controller’s specific phrases in it which is only loaded by that controller.

However, note that the translation is done to a PHP array. Essentially, it seems the gettext translator is really a front-end loader of the array translator. So why not use the array translator?

The only downside I can see is that it’s harder to get non-programmers to generate valid PHP arrays when supplying your translation. Other than that, anything that the PHP extension does to optimise compilation and processing of PHP code will kick in and give you a significantly improved performance. Put extra things on top of that like the Zend Optimisers and so forth, and you have a compelling reason to use highly modular array based translation.

The problem then remains getting valid PHP array files back from your translators, and frankly, that can be solved by writing a simple front end for your translators so they have a GUI to use.

Popularity: 100% [?]