This snapshot, taken on
07/09/2008
, shows web content acquired for preservation by The National Archives. External links, forms and search may not work in archived websites and contact details are likely to be out of date.
 
 
The UK Government Web Archive does not use cookies but some may be left in your browser from archived websites.
W3C   W3C Internationalization (I18n) Activity: Making the World Wide Web truly world wide!

Subscribe to this

(feed) Atom feed

Latest del.icio.us tags

Contributors

If you own a blog with a focus on internationalization, and want to be added or removed from this aggregator, please get in touch with Richard Ishida at ishida@w3.org.

All times are UTC.

Powered by: Planet

Planet I18n

The Planet I18n aggregates posts from various blogs that talk about Web internationalization (i18n). While it is hosted by the W3C Internationalization Activity, the content of the individual entries represent only the opinion of their respective authors and does not reflect the position of the Internationalization Activity.

September 05, 2008

Hacklog: Blogamundo

The Unicode Receipt

Yesterday I dropped by one of my favorite bookstores, Tempo Books in Tenleytown, DC. They specialize in foreign language books, among other things, so being a language nut, I always walk out a bit richer (and poorer).

Anyway, yesterday I noticed that they had written “thank you” in several languages on the receipt:

Thank you - Merci - Gracias - Obrigado - Evcharisto - Grazie - Danke - Spasibo - Arigato Gozaimashita - Takk - Shukran

The bold words are normally written in another script:

Evcharisto ευχαριστω
Spasibo Спасибо
Arigatou Gozaimashita ありがとうございます
Shukran ﺷﻜﺮﺍﹰ

It’s interesting to think about particular contexts where the ASCII legacy is very strong, and I think that Point of Sale machines are an example of this. I’m curious whether any Point of Sale systems out there use anything besides ASCII.

Maybe some day we’ll have receipts that can handle something like this:

by Patrick Hall at 05 September 2008 01:42 PM

September 03, 2008

Global By Design

Google Chrome and Simon: Separated at birth?

Is it just me or does the new Google Chrome icon remind you of the old Simon game of the 1980s. Yes, I know I’m dating myself here, but I do see a resemblance…

google chrome logo

Simon electronic game

by John Yunker at 03 September 2008 07:24 PM

September 02, 2008

Hacklog: Blogamundo

Translation in Wikimedia Projects

Some very interesting slides from Wikimania 2008 (argh, wish I had gone):

Image:Translation in Wikimedia Projects - Arria.pdf - Wikimedia Commons

(Click the document image to see the pdf.)

Lots of thought-provoking statistics in there. One that caught my eye particularly was that a majority of the interviewed translators on Wikimedia projects don’t translate at work.

Still, the numbers I myself have always wondered about are not revealed. Admittedly, it would be very hard to answer. Namely: what percentage of the text on the various Wikipedias is translated from another Wikipedia?

Also, some interesting reactions to the survey here: All The Modern Things: Translation in Wikimedia Projects

by Patrick Hall at 02 September 2008 11:23 PM

Another Book Translation Lookup Example

Yet another example of trying to find translations. Here’s a case where Index Translationum seems to almost cover the whole number of languages that the author gives himself:

From a course listing at UC Berkeley we find: Professor Reich is author of seven books, including The Work of Nations (which has been translated into 17 languages)

Searching for that work in Index Translationum, I find 9 distinct languages, and 15 listed. At least that accounts for half of them.

Anyway, I have found plenty of examples where Index Translationum is incomplete. For instance, this Spanish translation of this book is missing.)

Despite being something of a skeptic when it comes to the Semantic web, I must admit that translation referencing seems like a domain where it could help.

by Patrick Hall at 02 September 2008 05:00 PM

August 25, 2008

Hacklog: Blogamundo

Is it possible to mine translations from the Flickr API’s clusters?

I have too much to do at the moment to think much about this, but here ya go, half baked for your masticatory delectation:

I was talking to my homey Carlos about the possibility of getting useful translation information out of Flickr’s API.

In particular, I had been tipped off by a rather old article on Flickr’s i18n endeavors, which mentioned the possibility of finding translations for tags within automatic tag clustering:

…Flickr’s tag cluster analysis tools, which monitor which tags are commonly used in conjunction with other tags, can bridge gaps. For example, a Japanese user who types in the Japanese characters for “Tokyo” can click to see clusters of related tags, the top one of which is the English term “Tokyo.”

But while Japanese-language Flickr users evidently often add the “Tokyo” tag in English, the converse isn’t necessarily true, meaning that the tag cluster bridge in some cases runs only one way. Flickr’s “Toyko” English tag cluster doesn’t include the Japanese characters for Tokyo as a common tag companion.

The idea of mining Flickr tags for translations is certainly intriguing, but as far as I can tell, the difficulty lies in the fact that the clusters that Flickr returns are hard to filter by language, let alone by meaning.

So yes, tagged with tokyo cluster japan, night, shibuya, shinjuku, harajuku, street, 東京, 日本, people, city contains “東京” (”Tokyo”), but it also contains “日本” (Nihon, “Japan”). I don’t see an obvious path to nailing down just which of those terms is the right translation from such a list, statistically or otherwise.

But I haven’t though too hard, hoping that maybe someone else has!

by Patrick Hall at 25 August 2008 05:34 PM

August 22, 2008

Hacklog: Blogamundo

A plain-English description of a Computational Linguistics Thesis

This is a great idea:

Markus Dickinson, an Assistant Professor in Linguistics at U of Indiana, wrote up a primer on what his first publication was about for “non-linguistic friends and family.”

He explains a pretty complex thesis topic (finding errors in automated part-of-speech tagging) from the ground up, in a way that’s accessible to folks who might never have heard of Computational Linguistics.

I wonder if any other professorial types out there have done something similar? Hard sort of thing to search for.

by Patrick Hall at 22 August 2008 04:03 PM

August 20, 2008

Global By Design

Will .cn become the new .com?

I recently came across a chart of the most popular domain extensions, compiled by Stephane Van Gelder. Although I keep track of ccTLD registrations for the Country Codes of the World map, Stephane tracks all domains, including .com, .net., etc. And when I saw it I got to thinking…

Here’s a screen grab of the figures I want to focus on:

most popular domains

What makes this chart so interesting are the growth rates — .com is growing at 5% and .cn is growing at 18%. Granted, it’s easier to grow at 18% when you’ve only got 12 million registrations, compared with growing at 5% when you’ve got 76 million registrations.

But growth is growth and .cn is clearly on a roll.

And China has a lot of headroom for growth in terms of Web users and potential domain registrants. I am confident that .cn will reach 50 million registrations over the next 3 years.

At about that point in time, .com should be around 100 million registrants — in no danger of losing its number one status.

However, if the rate of growth of .com registrations were to decrease while .cn rate of growth continues to increase, it’s reasonable to wonder if we will one day see the number of .cn registrations surpass .com registrations?

I realize this is a far-fetched scenario.

After all, it’s reasonable to assume that companies that register .cn may also register .com — and the majority do just that.

But it’s certainly something to contemplate. And even if .cn never comes close to surpassing .com, the overall point I’d like to emphasize here is that .cn is now the world’s second most popular domain extention — and likely to remain that way for many years.

What do you think?

by John Yunker at 20 August 2008 01:46 AM

August 19, 2008

Global By Design

Taking Web forms global

A Japanese input Web form

Web form usability expert Luke Wroblewski provides a very handy article on the challenges of developing Web input forms that work in various countries.

Data input and output is where Web localization projects often sink or swim. And Web forms can give a global marketing director night sweats.

Luke stresses that if you can identify the user’s country before presenting the form, you’re in much better shape, because you can then provide a fully localized form. And this is why global navigation is so incredibly important to successful Web localization. If you can help your customer find his or her country Web site right from the start, everything else gets so much easier (for you and your customer).

Here’s the article.

by John Yunker at 19 August 2008 01:30 AM

August 18, 2008

Global By Design

The coming oil crunch (1979)

I’m a pack rat and I’m trying to rid myself of the habit.

But it was interesting to come across this copy of Newsweek magazine from 1979:

by John Yunker at 18 August 2008 01:47 AM

Olympics Web site adds two languages (at the wire)

A commenter on my post on the stunning lack of languages on the Olympics Web site (particularly when compared with Euro 2008) notes that two more languages were added recently: Spanish and Arabic.

Here are before and after shots of the language gateway.

August 6th:

August 14th:

What I find interesting is that these two languages were either added right when the Olympics began or possibly even a few days later.

by John Yunker at 18 August 2008 12:57 AM

August 15, 2008

W3C I18n Activity highlights

Updated tests & results: Language declarations

These tests examine whether language information is available for text processing when declared in various different ways.

The format of the tests was improved, and the 6th test page was dropped (dealing with language attributes on block elements) since it replicates tests elsewhere.

The results were rewritten to reflect behaviour of the latest major browsers.

Update: An error was fixed in test page 3 and three new test pages were added, to examine the effect of multiple language values in the meta element and precedence of language attribute and meta element. The tests were re-run and the results page updated. [search key: test-lang-decl]

by Richard Ishida at 15 August 2008 10:30 AM

August 14, 2008

Hacklog: Blogamundo

Scripts.txt - How to look up what writing system a Unicode character is from (uh, kind of)

Here’s an interesting corner of Unicode standard you may not know about:

UAX #24: Unicode Script Property1

The Script property tells you (in the large Scripts.txt what script a particular letter2 is in. The set of scripts names includes things like Han (Chinese characters, also used in Japanese and Korean), Ethiopic (scripts derived from Ge’ez , used in Ethiopia and Eritrea), Latin (you’re reading it), Cyrillic (Russian, Ukrainian, several Turkic languages, etc, etc), Greek (for Greek!), and so on. There are also a few specialized things like Unknown and Inherited, which we won’t get into because, uh, I haven’t bothered to understand how they are used yet.

But one application is pretty clear, from my point of view: referring to scripts can improve the accuracy of automatic language identification.

The mapping between scripts and languages is many-to-many, messy, and sometimes politically contentious.

Even so, referring to scripts can narrow down the possibilities. In a few instances there is a one-to-one mapping between script and language (Mandaic , Armenian , Vai…) . A language identification system should obviously know about those.

More commonly, matching scripts can be used to narrow down how many the “many”s mean in “many-to-many.” If it’s got Cyrillic in it, it’s from a language which is (at least sometimes) written in Cyrillic.

In the case of my own language id library (still not really out of the gate yet, hold tight!), the algorithm I used pukes on Chinese and Japanese. Why? The kind of algorithm I used would require a boatload of text to get a reasonably general model of the huge set of Chinese characters.3

Referring to the script should improve this problem; if some large percentage of the letters in the text have the script property Han then it’s may be Chinese. If it contains Han, Katakana, Hiragana, and Latin, it’s probably Japanese.

And so on.

There are some other applications of the information in Scripts.txt, as well. (The docs mention font handling.) I’d be curious to know if anyone else out there is using it somehow.

1) That’s a “Unicode Standard Annex,” which means it’s an official part of the Unicode standard. (Unicode.org has a page describing the differences between Standard Annexes, Technical Standards, and Technical Reports.)

2) I use the word “letter” in a general, and rather uncommon, way. I realize that most technical writers eschew that term, but I use it precisely because… most technical writers eschew that term. I have an old blog post where the topic comes up: Infundibulum: Ruby and Unicode.

3) Interestingly, trying to debug that issue led me to realize that the UDHR in Japanese contains zero katakana characters!

by Patrick Hall at 14 August 2008 02:29 PM

August 09, 2008

Global By Design

Google provides a bit of multilingual Web site advice

It’s not much, but it’s a start. Basically, Google says you don’t necessarily have to register a country code for each country Web site (though it certainly helps). But if you can’t get a ccTLD, at least be sure to let Google know (via Webmaster Tools) how your country subdomains are organized so Google can effectively spider them.

I remain a big proponent of using ccTLDs. After all, Google is not the leading search engine in all countries, particularly China and Russia.

I’d love to see Google do more in helping Webmasters understand how to manage content across multiple country Web sites. There is great concern over hosting duplicate content (which Google penalizes) and continued questions about managing multiple languages within a specific country.

(thx to my brother for the heads up on this post)

by John Yunker at 09 August 2008 04:06 PM

August 08, 2008

W3C I18n Activity highlights

New tests & results: CSS encoding detection

These tests examine whether user agents follow the rules in CSS 2.1 about detecting the encoding of CSS style sheets. This is particularly important if you style sheet uses non-ASCII characters in such things as class names, content, or font names. [search key: test-encoding-detection]

by Richard Ishida at 08 August 2008 05:05 PM

Web Access Centre Blog

Beijing Olympic website Part Two: internationalisation (#080808)

With all eyes on the Beijing for the 2008 Olympics I thought I’d publish a few observations of how well the official Beijing Olympic 2008 website works for international users. This post accompanies one I wrote about the accessibility of the Beijing 2008 website and flags where the cross overs exist with accessibility, localisation and internationalisation.

Quick disclaimer: my background is not in localisation and internationalisation but accessibility. My interest in the two comes from where I see issues that affect people with disabilities also affect international users. As with the accessibility review this is just a snap shot of a few observations and I’d love to hear more about your thoughts on the site so please go ahead and post comments.

Before we start lets have a quick look at what is meant by internationalisation (also known as i18n) and localisation (also known as l10n).

The W3C definition of internationalisation is:

“…the design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language”.

And the W3C definition of localisation is:

“…adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a “locale”)”.

This can be pretty confusing at first glance. The definition seems to apply two different things: going global versus going local. But if you’re a website owner based in the UK who wants to expand into foreign markets then you have to take the step towards going global before you can make an entrance locally. In simple terms, for the purpose of this article, what these two definitions mean is that a web site template needs to be internationalised (i.e. the back end and architecture etc) in order to support content to be localised, (text, colours, images, icons and so on).

Moving away from websites for a minute another way of looking at it is an analogy of a car. If the car is internationalised then the body of the car is built in such a way that a steering wheel can be easily fitted on either the left or the right. If the car has not been internationalised however then it wont be able to be customised to fit a steering wheel on either side so the body of the car has to be built from scratch. Once the body of the car is right then you can focus on colours and fittings. And this is a key thing to think about with both internationalisation and accessibility in a website. If both are incorporated from the start you’ll have no costly and difficult retrofits down the line.

So how does Beijing 2008 fare? In no particular order, here is a summary of a handful of issues I focused on:

Presentation versus Content

Key to both an accessible site and an internationalised site is separating presentation from content i.e. using structural mark up to indicate headings, lists and data tables and CSS for layout and formatting fonts and styles etc. Returning to the difference between internationalisation and localisation flagged at the start of this article this basically means ensuring that your web content is constructed so that it can be painted and decorated in a way that suits its target audience.

Let’s look at an example of why semantics are so important. Chinese uses italics as a form of emphasis when printed but this doesn’t tend to look that great on web pages, so using “i” tags around ideographic text is not an ideal solution. Emphasis i.e. “em” however can be used so that a dot appears over the character being emphasised or a shaded box appears as it’s background.

Looking at the Beijing 2008 site structural markup has been used for content and CSS for layout which is good. This allows for much easier localisation of each language component and flexibility to style text in a way that works for any given language. There is a slight problem with the heading structure coded in the site though, as headings are not always coded as they should be, as noted in the headings section of Beijing 2008: part one accessibility. This is a problem because if the semantics are not right the content becomes less useful or usable both in terms of access for technologies such as screen readers, (an access technology that reads aloud content to visually impaired users), and how the site is translated.

Language coding

When working on sites that have multiple languages it is important to indicate what the main language of a page is, and any language changes within the content . This is necessary for a number of reasons:

  • Screen readers will be able to identify language changes.
  • Browsers will be able to display text properly and display text according to language-specific hyphenation and spacing rules.
  • Search engines are better able to index your pages. Google, for example, allows you to search by language.

The Beijing 2008 site is available in Chinese, English, French, Spanish and Arabic with some pages having some parts of the page written in a language other than the main language of the page. For example in the French site there may be words presented in English. This is not a problem providing the page has been coded so that the main language is indicated and any changes in language on that page have also been identified.

The main language of the page should always be coded in the HTML tag in the head of the document. After that, any changes in the language on the page should also be marked up using the LANG attribute. So for example, if you have an English site that includes the French text Au Revoir!on a page the LANG attribute for French must be used to indicate a change i.e. LANG=”Fr”.

In the Beijing 2008 site the native language of the language versions is correctly coded for the Chinese (xml:lang=”zh-CN” lang=”zh-CN”), English (xml:lang=”en”, lang=”en”) and French (xml:lang=”fr-FR”, lang=”fr-FR”) pages but changes in language within page content is not. So for example the top right hand links on the English page to 中文 and Français are not coded using lang=”cn” and lang=”fr” respectively. This would be made possible if the image links were given appropriate lang attributes and alternative text, or were replaced with text links with the language change identified .

Links to alternative language versions

When people navigate to a site that is not in their native language one of the first things they will do is look for a link that takes them to a version of the site that is in their native language. As a website owner one of the key things you need to do therefore is make links to alternative language versions as clear and easy to find as possible. There are a number of ways that websites link to alternative language versions including drop down menus, images of linked flags and text links of the languages. Ideally however the best way to do this is to provide a link to the language version in the language of the target page. The Beijing 2008 site does just that at the top right of the page with images of text for English, 中文” and Français.

So why is this technique the preferred one? Let’s look at the example of a drop down menu with the text “Select language” first.

A drop down menu with country options and label

If you are a non English speaker landing on this site it’s unlikely that you will read and understand the text label “Please select your language” (not to mention the fact that the colour contrast is pretty poor) and then look inside the drop down.

The second option of using linked flags creates problems for speakers of the same language from different countries. Do you click on the Spanish flag if you’re from South America, the French flag if you are from Quebec in Canada? This is where localisation comes into play. You need to be sensitive to people’s origins and not bundle groups of people together who consider themselves very different. It’s not unusual to get a US and UK version of the same site after all.

One thing that did initially catch me out however was the lack of prominence of the language links and presenting them as images of text. The Olympics website is the ultimate in terms of a global website and I expected there to be a clear, obvious place where I could browse language options. Using images for text is never really advisable as they can not be scaled up and enlarged for those of us who need larger font sizes, nor can the colour be changed, if the contrast or colour combination is difficult to read..

Images and animations

How accepting of a website people are, often comes down to how it looks visually, it’s use of images and animations. People from different cultures can have a very different perception of what good design is, and what they gravitate towards. In China for example people appreciate animated images more than people in Europe. Generally the Chinese are also less concerned with lengthy pages to scroll down and compact information whereas in the West we fear the fold and don’t want key links and text to be lost in the crowd.

When looking at the Beijing 2008 home page alone you are immediately presented with an animated image that takes up almost all of the top third of the screen and rotates constantly. There is also scrolling images further down as well as an animated image and the page requires you to do quite a bit of scrolling. If part of localisation is about designing to specific region or culture then in my opinion Beijing 2008 doesn’t quite hit the mark. After three years working on the web in Shanghai China I’m pretty familiar with Chinese web design and the site definitely “feels” Chinese in origin. A Chinese look and feel is wholly appropriate as it is the Beijing Olympics but it could be done in a way that makes it also appealing to a Westerner’s eye. To illustrate what I mean look at the differences between the Yahoo! China website and the Yahoo! UK website. These sites both have the look and feel of Yahoo! but there are fundamental differences between both. The Chinese version contains more images and animated images, involves more scrolling and contains more information than the UK one.

The URL

When working on international sites with different language versions clear URL’s are important so that users can figure out where they are. The URL, http://en.beijing2008.cn/ , is not that intuitive for a user as it is neither easy to remember, tidy and logical. Obviously URL’s are designed to be machine readable and read by browsers however don’t underestimate the importance of unambiguous human readable URLs. When I look at this URL I see the “en” at the front which indicates to me that the site is in English but the fact that it ends in “cn” makes me wonder if it may be in Chinese. I’d prefer to see something like www.beijing2008.com/chinese or www.beijing2008.com/cn for the Chinese site followed by clear sub-domains. Del.icio.us and Flickr have unambiguous clear URLs. Looking at my page on del.icio.us you’ll see that it is constructed by the site domain followed by my online name http://del.icio.us/iheni. If searching for links on del.icio.us that I’ve posted about internationalisation then the logical construct is going to be http://del.icio.us/iheni/internationalization. A URL that is easy to work out, reinforces the name and brand as well as does wonders when making a site more memorable and usable when presenting content in multiple languages.

Recently ICAAN announced that it was testing the possibility of internationalised domains in eleven non-Romanized languages (i.e. translating .com) . It’s not clear at the time of writing as to when these would become established on the web, but it’s an interesting consideration for the Beijing 2008 site as well as the London 2012 site for that matter.

Conclusions

These really are just top level thoughts and there is much more that could be covered by someone far more qualified than I when commenting on the internationalisation and localisation of the site. That aside I think what is clear from looking at the Beijing Olympic site from the perspective of a user with disabilities, a mobile phone user or someone from a different linguistic or cultural background is that the site still has a long way to go before it can cater to the highest number of diverse audiences.

It makes me wonder if there are a standard set of practices that an Olympic website must follow which includes the Web Content Accessibility Guidelines, Internationalisation Best Practices and Mobile Web Best Practices that is handed down from the Olympic committee when each Olympic site is started. This would help build up a body of design knowledge for each new site therefore cutting down the work needed to be done in each new build.

London will be the host of the next Olympics in 2012 and there are clearly lessons to be learnt not least incorporating accessibility, internationalisation and mobile web best practices from the outset rather that slotting them in at vast cost after the site has been built. We’ve already seen the Sydney Olympic site fail in the courts due to accessibility, perhaps London 2012 can turn that around and be an exemplar of an accessible, internationalised site that renders well for mobile users.

Further reading and references

Got a resource? Leave a comment and let me know.

by Henny at 08 August 2008 03:55 PM

Global By Design

Global by Design now in 25 languages

I read about a startup (via Techcrunch) recently called mloovi. The service leverages Google Translate to provide real-time translations of your blog feed. I’ve installed the widget over on the right and would love to know what people think.

My biggest concern is slow-loading Web pages. And, yes, I know the quality of the translation will leave plenty to be desired, but what I really like about the widget are the little RSS feed buttons. Just click the button and you can have translated feeds delivered to whatever feed reader you use.

What I don’t understand is the significance of the name “mloovi.” Am I missing something?

by John Yunker at 08 August 2008 03:16 AM

August 06, 2008

Hacklog: Blogamundo

Microsoft trademarked “i’m”?

Microsoft apparently has trademarked the word “i’m“.

Looks like a good cause and all (fundraising for charity), but surely the trademark wasn’t necessary (why can’t they just do what they’re doing without trying to trademark that word?)

And if there has ever been a trademark that’s unenforceable, this has got to be it.

by Patrick Hall at 06 August 2008 05:37 PM

August 05, 2008

Hacklog: Blogamundo

Computational, You Say?

Man, am I confused.

Could someone explain to me how the “Computational” Linguistics Olympiad has anything to do with computation?

Via Language Log I learned that Google, the NSA, Cambridge University Press, and a bunch of Universities all over the US and Canada ran a competition called the “North American Computational Linguistics Olympiad.” The winners get to go to something called the 6th International Olympiad in Linguistics.

Query 1: Man, why didn’t this kind of thing exist when I was in high school? (Oh yeah, computational linguistics on clay tablets is tiresome…)

Query 2: Do the questions in the first round problems (pdf) really have anything to do with computation… at all?

Since the pdf says that the questions are copyrighted, I can’t reproduce them here, but I can say that they are exactly the sorts of problems that I was given in my undergraduate linguistics classes. (The kinds of classes where people would get indignant if you suggested they make use of machine-readable dictionaries… “That’s cheating!” they countered. I kid you not). There isn’t a thing in there that can’t be done with a pencil and paper. And indeed, that seems to have been the point; the students were given 3 hours and no computer to complete their “computational” linguistics test.

I’m doubly confused because of just how illustrious the committee running the thing was: a Who’s-Who of numerically-informed linguistics. (Among them was Steven Abney, author of the best paper on why linguistics needs statistics that I have ever had the pleasure of discovering.)

So why are all the problems exercises in logic? There’s nary a number in sight.

We do not grok.

by Patrick Hall at 05 August 2008 11:19 PM

Is there something wrong with putting Unicode into Javascript source code?

As far as I can tell, it’s perfectly okay to put UTF-8 encoded Javascript directly into .js files. And yet, programmers seem to be wary of doing so, and prefer to use numerically encoded escapes. So for instance, you’ll see:

nWithTilde = String.fromCharCode(241)

instead of:

nWithTilde = "ñ"

Why?

Is there some good reason to avoid putting actual characters in, or is it just legacy ASCII-ism? It seems to me that if a language has good Unicode support, as Javascript does, there’s no reason not to take advantage of it!

And the same can be said of other scripting languages. Python and Ruby both allow UTF-8 in source files if the file is marked with this special comment:

# coding: utf-8

If you’re dealing with non-ASCII content (and your default assumption should be that you are), then directly adding the characters seems a lot more elegant to me.

by Patrick Hall at 05 August 2008 04:02 PM

August 04, 2008

Hacklog: Blogamundo

Language Selection on Linkedin

Here’s an interesting new addition to my collection of Language Choosing Widgets, on LinkedIn (as you can see, Spanish is the only alternative there so far):

I like the little gray-on-white world map as an icon for language choice. Usually those are represented as a globe.

Yes, I posted about that mundane fact.

That is all.

Nothing else to see here, move along.

by Patrick Hall at 04 August 2008 09:42 PM

Global By Design

Watch out ProZ, here comes Google Translation Center

Within the translation industry, ProZ is widely known as the leading public network of freelance translators and buyers of translation services.

But here comes Google…

According to Blogoscoped, Google is about to launch the Google Translation Center.

This is an exciting development, though I don’t expect everyone to suddenly ditch ProZ for Google. Why? Because much of the appeal of ProZ is the community, which Google does not appear to be trying to support. Still, freelancers will certainly want to investigate this potential new resource.

I’ve called out ProZ as one company under threat from Google Translation Center. But EVERY translation agency needs to keep a close eye on this service. It could be a threat. It could also end up being something translation agencies use themselves — instead of paid platforms from SDL. Naturally, for this to happen this new platform has a lot of evolving to do. Still, I can’t help but wonder.

There is no mention of whether or not Google will support machine translation and/or translation memory. I’m assuming they will.

I have LOTS of questions and this service isn’t even live yet. So we shall see what happens. But this is big news, no question.

I wrote awhile back, that the translation industry as we know it is over. The technologists have taken over and they’re bringing brute force computing and massive networks to the table to reduce costs and increase time to market. This is just another sign of this macro trend.

What do you think? Is Google going to disrupt the translation industry or is this new platform going to fall flat?

(Thx Chris for the heads up!)

Update: I just read an insightful article on this Google’s service at GigaOm…

by John Yunker at 04 August 2008 04:29 PM

Kosovo requests ccTLD

Can an international crisis be started over a country code?

That’s what I can’t help but wonder when I read that Kosovo has requested its own country code domain.

Serbia (and it’s powerful backer Russia) do not accept Kosovo’s independence and are not going to be happy if Kosovo does get its own ccTLD. But ICANN may very well issue one if Kosovo meets certain criteria in the months ahead.

Keep in mind, you do not have to be a country in every sense of the word to have a country code. Antarctica (.aq) has one. So does Bouvet Island (.bv) — an uninhabited piece of land in the Atlantic.

by John Yunker at 04 August 2008 02:44 PM

W3C I18n Activity highlights

Updated tests & results: Language-dependent styling

These tests examine whether a user agent is capable of styling elements in XHTML 1.0 served as text/html using CSS selectors that examine the language declared in element attributes.

All tests have been rewritten, resulting in some small corrections and changes to the previous set of tests, and improvements to the page styling. Additional tests were added to check behaviour with regard to the script tags in BCP 47. Also, a whole set of tests were developed for XHTML 1.1 (served as XML). The results page was rewritten to reflect the behavior of the latest versions of a number of major browsers (including IE8 beta, which now supports :lang). [search key: test-css-lang]

by Richard Ishida at 04 August 2008 11:46 AM

July 31, 2008

Hacklog: Blogamundo

More on How to Find Book Translations

That post about how to find the translations of books turned out to address a more difficult question than I had imagined. Surely, somewhere there was a giant database, which maps translations to originals?

Well there are, sort of.

The big discovery for me was something called the Index Translationum, which was built by UNESCO. (I was tipped off to it by the very useful www.askusnow.info, a service where you can chat with librarians.)

For instance, there are 390 listings for a search for “The Lord of the Rings,” 442 for “Harry Potter” (yes, as a matter of fact, Harry Potter has been translated into Basque), and 24 for Philip K. Dick’s The Man in the High Castle. That last number would appear to be larger than the number of translations listed on www.philipkdick.com. It would also appear to be more than the 16 listed on LibraryThing. (click “Work Details”) It’s very important to note, however, that LibraryThing has the proper Unicode records of titles in non-Roman scripts, whereas the Index Translationum has a crummy, ill-defined transliteration system.

Thanks to Anirvan Chatterjee of BookFinder.com for his suggestions, which included the fact that LibraryThing tracks translations.

One final suggestion I’d throw in myself is simply to check Wikipedia. If the book is famous enough to have it’s own article, as The Man in the High Castle is, then the left-hand links often turn up several articles whose titles are probably the title of translated works. In this case, the links turn up Człowiek z Wysokiego Zamku, Manden i den store fæstning, Das Orakel vom Berge, El hombre en el castillo, Le Maître du Haut Château, La svastica sul sole, האיש במצודה הרמה, Mannen i det höga slottet, and 高堡奇人…

So the bottom line is, there are a bunch of places to look. But we’re not in the age of a “translation lookup web service” or anything like that, yet.

Further suggestions welcome…

(I’d add in passing that it’s really surprising that online bookstores don’t make this sort of information available. A Brazilian customer, say, who searches for “The Man in the High Castle” might be that much more likely to buy a copy if upon being informed that O homem do castelo alto exists…)

by Patrick Hall at 31 July 2008 03:36 PM

July 29, 2008

Global By Design

Map of the World Wide Web: Get ‘em while they last

Map of 180 country code TLDs

I’ve got about a hundred copies of this map remaining and I’m offering them for $3 each for orders of 25 or $2 each for orders of 50 (plus postage).

The map normally sells for $12 each, so this is a nice discount — and a great way to get your whole office a copy of this useful map.

Here are more details of the map:
http://bytelevel.com/map/map_of_WWW.html

Please note that this is a smaller version of the poster now being sold. It is designed to fit on a cubicle wall and displays 180 ccTLDs.

If you’re interested in purchasing, please contact me.

by John Yunker at 29 July 2008 03:58 AM

Hacklog: Blogamundo

Cuil’s Unicode Support… Or lack thereof.

The blawgs are ablather in talk about a new search engine called Cuil.

I took a look, and personally I think it looks pretty nice. The layout is original, it seems quite fast, and while the results don’t seem to be as good as that cough other search engine, it strikes me as better than some other alternatives I’ve seen.

Except for the deal killer.

Exhibit A: Russian
Википедию - Cuil

Exhibit B: Bengali
উইকিপিডিয়া - Cuil

Exhibit C: Japanese
ウィキペディア - Cuil

Exhibit D: French
Wikipédia - Cuil
It ignores the diacritic and searches for “Wikipedia.” The first hit is http://en.wikipedia.org/wiki/Wikipedia.

Exhibit E: Chinese
維基百科 - Cuil
Huh, Chinese works. Go figure.

(I just randomly tried those languages.) Almost all of the above return “No results because of high load… Due to excessive load, our servers didn’t return results. Please try your search again.” Which is obviously not the case, because ASCII searches run okay.

In other words, Cuil pretty much doesn’t index anything but ASCII and, uh, Chinese. The 1970s called, they want their regular expression back…

I’m sure more complaints like this Twitter post complaining about lack of Vietnamese support will bubble up…

Anyway, it’s not like building a search engine is easy or something! I hope a fix for this is in the works, and good luck to Cuil!

Oh brother. Putting the word “Cuil” into this post seems to have been a total spam trap… had to turn off comments on this post. Eh, I’ll just delete them.

by Patrick Hall at 29 July 2008 12:02 AM

July 26, 2008

Global By Design

China now leads in Internet users (and country codes)

The NY Times reports that China has surpassed the US in terms of Internet users. This comes via China’s state-controlled  Internet Network Information Center. Here are the key numbers:

United States

220 million Internet users

70% penetration

China

253 million Internet users

19% penetration

For readers of this blog, this development is hardly news. But it’s significant nonetheless. After all, the US isn’t exactly going to catch back up in this regard. China wins the numbers game, at least when it comes to people.

Here’s an interesting excerpt from the article:

Baidu, for instance, said on Thursday that its second-quarter net profit had jumped 81 percent. During that period, Baidu had a 63 percent share of China’s search engine market, while Google had about 26 percent, with Yahoo trailing far behind, according to iResearch, a market research firm based in Beijing.

Tencent, a popular site for social networking and gaming, now has a stock market value of $15 billion, making it one of the world’s most valuable Internet companies. In comparison, Amazon.com is valued at about $30 billion.

China also leads in having the world’s most popular country code (.cn).

by John Yunker at 26 July 2008 02:46 AM

July 24, 2008

Global By Design

Has Google hit a language ceiling?

Google announced that they now have 30 products available in 30 languages. And many of these products, such as Gmail and Adwords, now support 40 languages.

Here is a graph they published of the rate of growth of their language support. It’s a very impressive visual, but I found it potentially misleading.

Google\'s 40-language graph

What is being displayed is not the total number of “unique” languages Google supports, just the total number of product/language combinations. And that’s an important detail.

Google is nowhere near supporting 1,400 different languages. Their search engine interface, which supports roughly 120 languages, represents the maximum number of languages the company supports. And this number has only increased by about 10 languages over the past two years.

The other Google applications appear to have peaked (for now) at between 40 and 43 languages.

To support 40 languages is remarkable. Based on my survey of 225 global Web sites in the 2008 Web Globalization Report Card, fewer than 10 companies support 40 or more languages (English excluded).

Still, it looks as if Google is now focused on getting its increasingly wide selection of software up to the 40-language mark rather than aggressively pushing into brand new languages. Gmail, for instance, now appears to be adding a language or two per year — rather than 10 to 20, which is the pace we’ve been seeing with YouTube and Blogger.

by John Yunker at 24 July 2008 03:27 PM

Hacklog: Blogamundo

How do you figure out what languages a book has been translated into?

I was watching an interview of an author this morning, and by way of introduction, the interviewer said that the author’s book had been “translated into 30 languages.”

That’s a standard phrase, but I wondered which 30 languages the book had been translated into.

And then I realized I have no idea how to find out the answer to that question.

Do you?

by Patrick Hall at 24 July 2008 12:24 PM


Contact: Richard Ishida (ishida@w3.org).