Not long ago, after a
RegEx workshop I taught along with Nora Díaz, I was approached by one of the students who
asked: "How can you look for words in other alphabets?" I answered that for
single characters there is the Unicode escape sequence, but I had no idea of
how to do it in other alphabets. I did a little research and this article
answers the question.
When you look for (or
filter for) a single character, Unicode is pretty straightforward. For example,
if you filter for \u00a9 in Trados you will be seeing only the segments
containing the copyright symbol. What about entire alphabets? I did some
research and came up with some answers.
Greek
Fortunately, Unicode
designers put letters in a contiguous block. Lowercase Greek letters are in
positions 0x03B1(α) through 0x03C9 (ω) and uppercase letters are in positions 0x0391 (Α) through 0x03A9 (Ω). In Trados you could just filter for
[\u03b1-\u03c9\u0391-\u03A9]
or
[α-ωΑ-Ω]
just like you would
look for [a-zA-Z] in the Latin alphabet.
One nifty thing about Unicode
is that it includes the two forms of the lowercase sigma: ς and σ. In the
uppercase, which has only one form of sigma there is a “hole”, a non-defined
character, between rho and sigma so the block size for lowercase is the same
for uppercase.
Be advised that this
method cannot find letters with diacritics, which are outside this block. So
you cannot look for, say, alpha with a circumflex accent, but if you are
looking for the resistivity symbol (lowercase rho, ρ) or the cosmological
constant (uppercase lambda, Λ), you will be covered.
Hebrew
The Hebrew alphabet
has 14 letters and 5 of them have a different form when written at the end of a
word. The letters are in Unicode in block 0x05d0 (alef, א) through 0x05ea (ת, tav). There are no different forms for uppercase.
So, you could filter
for
[\u05d0-\u05ea]
or
[א-ת]
Please notice that the
first letter, alef (א), is
written to the right of the hyphen, apparently against the rule that the lower limit
on a range should be written to the left. This is because Hebrew is a
right-to-left language, and so the alef is actually written in front of the
hyphen.
Again, be advised that
this method cannot find letters with diacritics.
Cyrillic
Cyrillic is the only
alphabet of a widespread language whose creator is identified. Cyrillic is used
for Russian, Bulgarian, Belarusian and Ukrainian, among many others; its users are
counted by the hundreds of millions and they are in many countries. The bulk of
the Cyrillic letters are in Unicode positions 0x0400 through 0x044F (uppercase
and lowercase).
If you want to look for uppercase letters, search for
[\u0410-\u042F] or [А-Я]
For lowercase, filter
for
[\u0x430-\u044F] or
[а-я]
And for the whole
alphabet
[\u0410-\u044F] or [А-я]
However, if you want
to play it safe, filter for
[\u0400-\u04ff] or
[Ѐ-ӿ]
This covers the whole
gamut, from ye with grave (Ѐ) thru kha with stroke (ӿ).
Conclusions
Looking for strings of
non-Latin characters could appear to be an intimidating task, but thanks to
ingenious Unicode design and to a clever Regex implementation, building and
understanding these regexes is not difficult. The only difficult part is that
many programs switch directions upon detecting a right-to-left language character, and moving the cursor around can be tricky, but a workaround for
this is writing the range limits as Unicode sequences, which never change
direction themselves.
References
No comments:
Post a Comment