Monday, May 5, 2008

Google is moving to Unicode 5.1

Google has just begun supporting Unicode 5.1, less than one month after it was released. It's now available in search, so people speaking languages such as Malayalam can now search for words containing the new characters in Unicode 5.1.

Web pages can use a variety of different character encodings, like ASCII, Latin-1, or Windows 1252, or Unicode. Most encodings can only represent a few languages, but Unicode will handle anything from Chinese to French to Arabic. We have long used Unicode as the internal format for all the text we search: any other encoding is first converted to Unicode for processing. So we regularly update to each new version of Unicode (and relevant related standards like CLDR and BCP 47) to make sure we are current. Thus Unicode plays a key role in google's mission.

Just last December there was an interesting milestone on the web. For the first time, we found that Unicode was the most frequent encoding found on web pages, overtaking both ASCII and Western European encodings—and by coincidence, within 10 days of one another. What's more impressive than simply overtaking them is the speed with which this happened; take a look at the blue line in this graph.


You can see a long-term decline in pages encoded in ASCII (unaccented letters A through Z). More recently, there's been a significant drop in the use of encodings covering only Western European letters (ASCII and a few accented letters like Ä, Ç, and Ø). We're seeing similar declines in other language-specific encodings. Unicode, on the other hand, is showing a sharp increase in usage.

This is based on our indexing of web pages, and thus may vary somewhat from what other search engines find. However, the trends are pretty clear, and the continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover.

0 comments: