Wednesday, February 26, 2020

Regex for Non-Latin Alphabets

A guest post by Salvador Virgen


Not long ago, after a RegEx workshop I taught along with Nora Díaz, I was approached by one of the students who asked: "How can you look for words in other alphabets?" I answered that for single characters there is the Unicode escape sequence, but I had no idea of how to do it in other alphabets. I did a little research and this article answers the question.

When you look for (or filter for) a single character, Unicode is pretty straightforward. For example, if you filter for \u00a9 in Trados you will be seeing only the segments containing the copyright symbol. What about entire alphabets? I did some research and came up with some answers.

Greek


Fortunately, Unicode designers put letters in a contiguous block. Lowercase Greek letters are in positions 0x03B1(α) through 0x03C9 (ω) and uppercase letters are in positions 0x0391 (Α) through 0x03A9 (Ω). In Trados you could just filter for

[\u03b1-\u03c9\u0391-\u03A9]

or

[α-ωΑ-Ω]

just like you would look for [a-zA-Z] in the Latin alphabet.

One nifty thing about Unicode is that it includes the two forms of the lowercase sigma: ς and σ. In the uppercase, which has only one form of sigma there is a “hole”, a non-defined character, between rho and sigma so the block size for lowercase is the same for uppercase.

Be advised that this method cannot find letters with diacritics, which are outside this block. So you cannot look for, say, alpha with a circumflex accent, but if you are looking for the resistivity symbol (lowercase rho, ρ) or the cosmological constant (uppercase lambda, Λ), you will be covered.

Hebrew


The Hebrew alphabet has 14 letters and 5 of them have a different form when written at the end of a word. The letters are in Unicode in block 0x05d0 (alef, א) through 0x05ea (ת, tav). There are no different forms for uppercase.

So, you could filter for

[\u05d0-\u05ea]

or

[א-ת]

Please notice that the first letter, alef (א), is written to the right of the hyphen, apparently against the rule that the lower limit on a range should be written to the left. This is because Hebrew is a right-to-left language, and so the alef is actually written in front of the hyphen.

Again, be advised that this method cannot find letters with diacritics.

Cyrillic


Cyrillic is the only alphabet of a widespread language whose creator is identified. Cyrillic is used for Russian, Bulgarian, Belarusian and Ukrainian, among many others; its users are counted by the hundreds of millions and they are in many countries. The bulk of the Cyrillic letters are in Unicode positions 0x0400 through 0x044F (uppercase and lowercase). 

If you want to look for uppercase letters, search for

[\u0410-\u042F] or [А-Я]

For lowercase, filter for

[\u0x430-\u044F] or [а-я]

And for the whole alphabet

[\u0410-\u044F] or [А-я]

However, if you want to play it safe, filter for

[\u0400-\u04ff] or [Ѐ-ӿ]

This covers the whole gamut, from ye with grave (Ѐ) thru kha with stroke (ӿ).

Conclusions


Looking for strings of non-Latin characters could appear to be an intimidating task, but thanks to ingenious Unicode design and to a clever Regex implementation, building and understanding these regexes is not difficult. The only difficult part is that many programs switch directions upon detecting a right-to-left language character, and moving the cursor around can be tricky, but a workaround for this is writing the range limits as Unicode sequences, which never change direction themselves.

References


No comments:

Post a Comment