Skip to content Skip to sidebar Skip to footer

Find Out The Unicode Script Of A Character

Given a unicode character what would be the simplest way to return its script (as 'Latin', 'Hangul' etc)? unicodedata doesn't seem to provide this kind of feature.

Solution 1:

I was hoping someone's done it before, but apparently not, so here's what I've ended up with. The module below (I call it unicodedata2) extends unicodedata and provides script_cat(chr) which returns a tuple (Script name, Category) for a unicode char. Example:

# coding=utf8import unicodedata2
print unicodedata2.script_cat(u'Ф')  #('Cyrillic', 'L')print unicodedata2.script_cat(u'の')  #('Hiragana', 'Lo')print unicodedata2.script_cat(u'★')  #('Common', 'So')

The module: https://gist.github.com/2204527

Solution 2:

It seems to me that the Python unicodedata module contains tools for accessing the main file in the Unicode database but nothing for the other files: “The data in this database is based on the UnicodeData.txt file”

The script information is in the Scripts.txt file. It is of relatively simple format (described in UAX #44) and not horribly large (131 kilobytes), so you might consider parsing it in your program. Note that in the Unicode classification, there’s the “Common” script that contains characters used in different scripts, like punctuation marks.

Solution 3:

You can use ord to retrieve the numeric value of a character (it works on both unicode and byte strings of length 1).

The next step, unfortunately, will involve you then testing against the ranges. Possibly the data here will be of assistance: http://cldr.unicode.org/index/downloads

Solution 4:

The only way I know of is unfortunately to get the Unicode code point with ord() and then use your own table (by using http://en.wikipedia.org/wiki/Unicode#Standardized_subsets and more). A preliminary conversion to some normal form may be in order, so as to handle the fact that a single "written" character can be expressed with different sequences of code points (the unicodedata module helps, here).

Solution 5:

Oftentimes it is just enough to detect if a certain script is used, and then you can use the unicodedata.name with prefix matching. For example to find out whether a letter is Cyrillic, you can use

classCharacterNamePrefixTester(dict):
    def__init__(self, prefix):
        self.prefix = prefix
    def__missing__(self, key):
        self[key] = unicodedata.name(key, '').startswith(self.prefix)
        return self[key]

>>> cyrillic = CharaterNamePrefixTester('CYRILLIC ')
>>> cyrillic['й']
True>>> cyrillic['a']
False

The dictionary is built lazily but the truth values are memoized so that future lookups of the same letter will be faster.

Post a Comment for "Find Out The Unicode Script Of A Character"