Wednesday, February 26, 2020

Regex for Non-Latin Alphabets

A guest post by Salvador Virgen


Not long ago, after a RegEx workshop I taught along with Nora Díaz, I was approached by one of the students who asked: "How can you look for words in other alphabets?" I answered that for single characters there is the Unicode escape sequence, but I had no idea of how to do it in other alphabets. I did a little research and this article answers the question.

When you look for (or filter for) a single character, Unicode is pretty straightforward. For example, if you filter for \u00a9 in Trados you will be seeing only the segments containing the copyright symbol. What about entire alphabets? I did some research and came up with some answers.

Greek


Fortunately, Unicode designers put letters in a contiguous block. Lowercase Greek letters are in positions 0x03B1(α) through 0x03C9 (ω) and uppercase letters are in positions 0x0391 (Α) through 0x03A9 (Ω). In Trados you could just filter for

[\u03b1-\u03c9\u0391-\u03A9]

or

[α-ωΑ-Ω]

just like you would look for [a-zA-Z] in the Latin alphabet.

One nifty thing about Unicode is that it includes the two forms of the lowercase sigma: ς and σ. In the uppercase, which has only one form of sigma there is a “hole”, a non-defined character, between rho and sigma so the block size for lowercase is the same for uppercase.

Be advised that this method cannot find letters with diacritics, which are outside this block. So you cannot look for, say, alpha with a circumflex accent, but if you are looking for the resistivity symbol (lowercase rho, ρ) or the cosmological constant (uppercase lambda, Λ), you will be covered.

Hebrew


The Hebrew alphabet has 14 letters and 5 of them have a different form when written at the end of a word. The letters are in Unicode in block 0x05d0 (alef, א) through 0x05ea (ת, tav). There are no different forms for uppercase.

So, you could filter for

[\u05d0-\u05ea]

or

[א-ת]

Please notice that the first letter, alef (א), is written to the right of the hyphen, apparently against the rule that the lower limit on a range should be written to the left. This is because Hebrew is a right-to-left language, and so the alef is actually written in front of the hyphen.

Again, be advised that this method cannot find letters with diacritics.

Cyrillic


Cyrillic is the only alphabet of a widespread language whose creator is identified. Cyrillic is used for Russian, Bulgarian, Belarusian and Ukrainian, among many others; its users are counted by the hundreds of millions and they are in many countries. The bulk of the Cyrillic letters are in Unicode positions 0x0400 through 0x044F (uppercase and lowercase). 

If you want to look for uppercase letters, search for

[\u0410-\u042F] or [А-Я]

For lowercase, filter for

[\u0x430-\u044F] or [а-я]

And for the whole alphabet

[\u0410-\u044F] or [А-я]

However, if you want to play it safe, filter for

[\u0400-\u04ff] or [Ѐ-ӿ]

This covers the whole gamut, from ye with grave (Ѐ) thru kha with stroke (ӿ).

Conclusions


Looking for strings of non-Latin characters could appear to be an intimidating task, but thanks to ingenious Unicode design and to a clever Regex implementation, building and understanding these regexes is not difficult. The only difficult part is that many programs switch directions upon detecting a right-to-left language character, and moving the cursor around can be tricky, but a workaround for this is writing the range limits as Unicode sequences, which never change direction themselves.

References


Sunday, November 17, 2019

Regular Expressions for Translators: Replacements




One of the first things that I wanted to learn when I first started looking into regular expressions was how to do replacements. In this article we will look at how regex replacements work and how we can use them in SDL Trados Studio.

As with any replacement operation, we must first find the string that we want to replace. To do this, we can use any regex built with metacharactersanchorscharacter classesspecial charactersquantifiers, and groups and ranges. In the replacement part, however, none of these regular expression elements are supported, and we can only use literal characters and substitutions consisting of a dollar sign followed by a number. The $ is the only special character that can appear both in a regex pattern or in a substitution, although with different meanings. In a regex, $ is an anchor that indicates the end of a string. In a replacement pattern, it indicates the beginning of a substitution.

Substitution elements such as $1 or $3 represent the capturing groups in the regular expression matched in the "Find" part of the operation, with groups being assigned consecutive numbers from left to right, starting with 1.

Let's look at a few examples.





Regex pattern: (\w+)our
Replacement pattern:  $1or

In our first example, we want to replace the British word ending "-our" with the American spelling "-or".



In the example above, the regex pattern matches any word character, one or more times, followed by the literal characters "our". This pattern will match each of the words in the sample string: behaviour, colour, humour, labour, neighbour and flavour, and their plural forms, and will allow us to replace them. The pattern to the left of "our" is in parentheses, indicating that it is a group. Since regex groups are automatically assigned consecutive numbers from left to right, this would be group 1. We place this part of our regex in a group for the specific purpose of using the matched contents in the replacement, by using its corresponding substitution element.

Now, in the replacement, we will use a substitution element, $1, which means that the contents of group 1 will be transferred to the replacement, plus the literal characters "or", the replacement for "our", to change the spelling from British to American.

The animated gif below shows how each word is matched by the Find operation and then replaced.









Regex pattern: (\d+th)(\s)(October|November|December)
Replacement pattern:  $3 $1



In our second example, we want to change dates such as 20th November to November 20th. We will use a regex that has three groups:

Group 1: (\d+th)
One or more digits followed by the literal characters "th"
Substitution element: $1 

Group 2: (\s) 
A space
Substitution element: $2

Group 3: (October|November|December)
Any of these words
Substitution element: $3

In the replacement pattern, we need to place Group 3 ($3) at the beginning, followed by a space, followed by Group 1 ($1). Since the space is in its own capturing group, it could be represented by $2, but in this example I chose to enter a literal space, by pressing the spacebar, between $3 and $1, which works just as well. Note that what we can't do is use \s in the replacement pattern to enter a space. If we use \s in the replacement pattern, the literal characters "\s" will appear in the replacement text.

Here's the replacement operation in action.








Regex pattern: (\d+th)(?:\s)(October|November|December)
Replacement pattern:  $2 $1

In the previous example, we used 3 capturing groups in the regex. If instead we place the space inside a non-capturing group*, then the numbers assigned to the groups would change, and the replacement pattern would be different.


Here, we also have 3 groups, but one is a non-capturing group, indicated by the ?: inside the parentheses.

Group 1: (\d+th)
One or more digits followed by the literal characters "th"
Substitution element: $1 

Group 2: (?:\s) 
A space
Substitution element: None, this is a non-capturing group, so its contents are not saved to be used later on.

Group 3: (October|November|December)
Any of these words
Substitution element: $2



Note: In this example, the space is placed inside a non-capturing group only to give an example of how non-capturing groups work, but actually we could use a regex that doesn't place the space in a group, with the same effect: (\d+th)\s(October|November|December).






We have said before that capturing groups are assigned consecutive numbers, starting with 1, that can be used later on in the replacement pattern. But what if we want to use the entire matched string in the replacement operation? In that case, we use $0. You may be wondering when you would need to do this. Consider the following example.


Regex pattern: (\d+,)?\d+\.\d+\scash
Replacement pattern: $$$0



In this example, we want to add a dollar sign in front of any instances of an amount followed by a space and the word cash.

The regex pattern captures the various amounts by matching one or more digits followed by a comma (this first part is made optional by placing the regex in a group and adding the ? quantifier, which means 0 or 1 times), followed by one ore more digits followed by a point, followed by one or more digits, followed by a space and the word "cash".

Since the dollar sign is a special character in the replacement pattern, when we want to enter a literal dollar sign in the replacement, we must use $$. Thus, $$$0 means a dollar sign ($$) followed by the entire string ($0) that matches the regex.

See the replacement in action below.


For this article, we've looked at replacements in the Editor view, but SDL Trados Studio also accepts regex replacements in the Translation Memories view and regex replacement syntax in the Verification settings.

With this, we have come to the end of this article. If you'd like to have a copy of my cheat sheet, you can download it here:




 ¡Pregunta por los precios especiales de SDL Trados Studio para México!







Tuesday, November 12, 2019

Regular Expressions for Translators: Groups and Ranges

In this article, we'll talk about groups and ranges in regular expressions and how they can be used by translators in CAT Tools such as SDL Trados Studio. 


Let's have a look at these regex components and some of their applications. 





The previous post about quantifiers ends with a brief introduction of the function of the dot in regular expressions: a wildcard that represents any character.


Regex example: .*?,

Using a single dot will match any one character. Combining the dot with a quantifier will match more than one. This regex will match all the text up to a comma, as shown below:



A word of caution about the dot in regex
While it may seem tempting to use the dot wildcard frequently, one must be aware of potential undesired results.

For example, imagine that we want to find all the text that comes between straight quotation marks so we can later replace the straight quotation marks with curly quotation marks. Using a regex such as ".+" (a straight quotation mark followed by anything, one or more times, followed by a straight quotation mark) would seem like an easy solution, but look at what can happen below:


Instead of getting two matches: "I will see you there" and "don't be late", we get a single match, from the very first quotation mark in the segment to the very last one.

These undesired results are not always evident when using a regular expression in the SDL Trados Studio display filter, for example, so it's always a good idea to test the regex in a regex tester such as regexstorm.net/tester, which I will use for the examples in this article.

Bonus tip: A better regex to find each separate instance of text inside straight quotes is "[^"]*".








Regex example: col(o|ou)r

In regular expressions, the vertical bar or pipe character | indicates alternation. Using the pipe tells the regex to match everything to the left or everything to the right of the pipe, as shown here:


Here, the strings that match the regex colo|our are "colo" and "our". If we want to match "color" and "colour" instead, we need to use parentheses:



Look at the example below to see how alternation can be used to match any of the days of the week.


Now, you may notice that in this example, all the text to the left or right of the pipe is matched, whether it's a whole word or not. If what we want to do is match whole words only, we can use parentheses to create a group and then apply word boundaries to the entire group.


I use this regex in a verification rule in SDL Trados Studio to alert me about segments where the day of the week is not present in the source but is present in the target:








Regex example: (\d+,)+

Parentheses are used to create groups in regular expressions. Look at this example:


Here, the regex \d+, is matched twice, once by "123," and once by "456,". If instead we want to include both instances in a single match, we need to add parentheses to the expression and then add the + quantifier to the group.



A group can also be a single character, as in the example below, where the ? quantifier (which means 0 or 1) is used to make the s character optional.



Lastly, groups have two purposes in regular expressions: to organize information and to capture the contents of the group. The captured information is "remembered" by the regex engine and can later be used for backreferences or substitutions.

Consider this example:


Here, the regex pattern has been organized into five groups, as shown in the table above, at the bottom of the regex tester window. Each group is assigned a consecutive number, so group 1 captures the character sequence 123, group 2 captures the first comma, group 3 captures the character sequence 456, group 4 captures the second comma and group 5 captures the character sequence 789.

In the replacement pattern, we can rearrange the groups by representing each group with the dollar sign followed by the group number. In our example, the groups have been rearranged to $5$2$3$4$1, resulting in the replacement string 789,456,123. 

Note: The same result can be achieved by using commas instead of the numbered groups $2 and $4, which would make the replacement pattern $5,$3,$1.





In addition to regular groups, there is a less commonly used type of group, called passive or non-capturing. The only difference between a non-capturing group and a regular group is that a non-capturing group organizes the information contained in the group, but doesn't capture it, that is, the information in the group is not assigned a group number. 

Let's use the same example we used before to understand how this works. Instead of having five regular groups, the commas will now be placed inside non-capturing groups by using the following syntax: (?:,).


While we still have 5 groups organizing the information, two of them are non-capturing, so the number of groups available for substitutions (replacement operations) is reduced to three, as shown in the table at the bottom of the regex tester window.

With this, the replacement pattern to achieve the same result as in the previous example is now $3,$2,$1.

While there aren't many use cases that come to mind for using non-capturing groups, they can come in handy when one wants to keep the number of capturing groups down to avoid having to keep track of too many group numbers.





Regex example: \d+[abz?*]

Placing characters inside square brackets means that any one of the characters in that set can be matched in that position, in no particular order. Have a look at this example:


Note that when used inside a character set range, metacharacters don't need escaping.





Regex example: \d+[^abz?*]

Adding a caret (^) inside the square brackets means that the characters included inside the square brackets should be excluded.


This is how the quotation mark regex mentioned earlier works:  "[^"]*" will be matched by a quotation mark followed by zero or more of any character except a quotation mark, followed by a quotation mark.








Lastly, let's have a look at these character ranges: lowercase letters, uppercase letters and digits.

Regex example: [A-Z][a-z]+

Lowercase and uppercase letters can be helpful when we need to specify case, for instance, when we want to find words that begin with a capital letter.



The sample regex here means one uppercase letter followed by one or more lowercase letters. But if this is the case, then how come the words "Añoranzas" and "Épicas" are not matched? The reason is that the ranges [A-Z] and [a-z] include only characters in the English alphabet. A solution to include other non-English letters is to add them to the character set:


While these examples use the full range of letters in the English alphabet, it's also possible to limit the range. In the example below, by limiting the uppercase range to "A-I", the words "The" and "La" are excluded from the matches.




Regex example: [0-9.,/"'-]+

While we could say that [0-9] is basically the same as \d, the [0-9] character range offers a bit more flexibility, as we can easily throw in a few other characters into the range to help us cover a variety of number formats.


In this example, the regex matches numbers with decimals, commas, fractions, and dashes, without having to come up with any complex expressions. While this may not be the most elegant solution for someone writing code for a program, it certainly can be a time-saver for a translator wanting to filter segments.

With this, we have come to the end of this article. If you'd like to have a copy of my cheat sheet, you can download it here:


Happy regexing!



 ¡Pregunta por los precios especiales de SDL Trados Studio para México!





Sunday, November 10, 2019

Regular Expressions for Translators: Quantifiers

While character classes allow us to match a variety of characters, quantifiers bring more power to regular expressions by allowing us to specify how many times those characters should be matched.


              Download Nora’s Regular Expressions for Translators Cheat Sheet


Let's add each quantifier to a digit (\d) regex and look at the different results we get in a Find operation in SDL Trados Studio.


Regex: \d

The match for this regex is a single digit, as shown below.







Regex: \d*

Adding the * (zero or more) quantifier to \d gives us a different match: a series of consecutive digits.


However, in SDL Trados Studio, this regex will also match text that is not a digit. Clicking "Find Next" in the example above, causes this regex to match the comma.







Regex: \d+

Adding the + (one or more) quantifier to \d means that the match must have at least one digit.


Clicking "Find Next" here will skip the comma and match the next number group.


Note that this simple regex will match a group of numbers even if they are part of a string containing other characters. Have a look at this example, where I've added the letters AFG to the number 879.







Regex: \d?

Adding the ? (zero or one) quantifier to \d means that the match will be either zero digits or one digit.


Clicking "Find Next" will match each individual character in the segment, including the comma and letters.







Regex: \d{4}

A single number inside curly brackets is a quantifier that indicates that the preceding regex must be matched that exact number of times.


Clicking "Find Next" matches the next four-digit sequence in the segment.







Regex: \d{4,}

A number followed by a comma inside curly brackets indicates that the preceding regex must be matched that exact number of times, or more.


Clicking "Find Next" gives us the following result.







Regex: \d{3,5}

Two numbers separated by a comma (no space) inside curly brackets indicates a range of times that the preceding regex must be matched. In this example, it will be 3, 4 or 5 times.

Running a Find operation with this regex in the active segment below matches the first 3-digit group it finds.



Clicking "Find Next" two more times gives the following results.






Greedy or Lazy?

Quantifiers are greedy by default, which means that they will match as many occurrences of the regex pattern as possible. Consider the example below.

Regex: .+-

The dot character is a wildcard for any character, so this regex will match any character one or more times, as many times as possible (greedy) followed by a dash.

In the active segment below, it looks like there are three instances of groups of characters (numbers) followed by a dash: 614-, 597- and 7855-, but in fact, the first two dashes are interpreted as "any character" due to the use of the greedy quantifier +, which keeps matching "any character" as many times as possible until the last dash is found.


Making the quantifier lazy by adding a ? will cause the expression to recognize the first dash as a dash and not as any character, which results in a different match:



Each of the quantifiers above can be made lazy by adding a ? to it, which will result in the expression being matched as few times as possible.

Finally, while most of the examples above are based on digits, quantifiers can be used with any other regular expressions.




 ¡Pregunta por los precios especiales de SDL Trados Studio para México!