Sunday, November 17, 2019

Regular Expressions for Translators: Replacements

One of the first things that I wanted to learn when I first started looking into regular expressions was how to do replacements. In this article we will look at how regex replacements work and how we can use them in SDL Trados Studio.

As with any replacement operation, we must first find the string that we want to replace. To do this, we can use any regex built with metacharactersanchorscharacter classesspecial charactersquantifiers, and groups and ranges. In the replacement part, however, none of these regular expression elements are supported, and we can only use literal characters and substitutions consisting of a dollar sign followed by a number. The $ is the only special character that can appear both in a regex pattern or in a substitution, although with different meanings. In a regex, $ is an anchor that indicates the end of a string. In a replacement pattern, it indicates the beginning of a substitution.

Substitution elements such as $1 or $3 represent the capturing groups in the regular expression matched in the "Find" part of the operation, with groups being assigned consecutive numbers from left to right, starting with 1.

Let's look at a few examples.

Regex pattern: (\w+)our
Replacement pattern:  $1or

In our first example, we want to replace the British word ending "-our" with the American spelling "-or".

In the example above, the regex pattern matches any word character, one or more times, followed by the literal characters "our". This pattern will match each of the words in the sample string: behaviour, colour, humour, labour, neighbour and flavour, and their plural forms, and will allow us to replace them. The pattern to the left of "our" is in parentheses, indicating that it is a group. Since regex groups are automatically assigned consecutive numbers from left to right, this would be group 1. We place this part of our regex in a group for the specific purpose of using the matched contents in the replacement, by using its corresponding substitution element.

Now, in the replacement, we will use a substitution element, $1, which means that the contents of group 1 will be transferred to the replacement, plus the literal characters "or", the replacement for "our", to change the spelling from British to American.

The animated gif below shows how each word is matched by the Find operation and then replaced.

Regex pattern: (\d+th)(\s)(October|November|December)
Replacement pattern:  $3 $1

In our second example, we want to change dates such as 20th November to November 20th. We will use a regex that has three groups:

Group 1: (\d+th)
One or more digits followed by the literal characters "th"
Substitution element: $1 

Group 2: (\s) 
A space
Substitution element: $2

Group 3: (October|November|December)
Any of these words
Substitution element: $3

In the replacement pattern, we need to place Group 3 ($3) at the beginning, followed by a space, followed by Group 1 ($1). Since the space is in its own capturing group, it could be represented by $2, but in this example I chose to enter a literal space, by pressing the spacebar, between $3 and $1, which works just as well. Note that what we can't do is use \s in the replacement pattern to enter a space. If we use \s in the replacement pattern, the literal characters "\s" will appear in the replacement text.

Here's the replacement operation in action.

Regex pattern: (\d+th)(?:\s)(October|November|December)
Replacement pattern:  $2 $1

In the previous example, we used 3 capturing groups in the regex. If instead we place the space inside a non-capturing group*, then the numbers assigned to the groups would change, and the replacement pattern would be different.

Here, we also have 3 groups, but one is a non-capturing group, indicated by the ?: inside the parentheses.

Group 1: (\d+th)
One or more digits followed by the literal characters "th"
Substitution element: $1 

Group 2: (?:\s) 
A space
Substitution element: None, this is a non-capturing group, so its contents are not saved to be used later on.

Group 3: (October|November|December)
Any of these words
Substitution element: $2

Note: In this example, the space is placed inside a non-capturing group only to give an example of how non-capturing groups work, but actually we could use a regex that doesn't place the space in a group, with the same effect: (\d+th)\s(October|November|December).

We have said before that capturing groups are assigned consecutive numbers, starting with 1, that can be used later on in the replacement pattern. But what if we want to use the entire matched string in the replacement operation? In that case, we use $0. You may be wondering when you would need to do this. Consider the following example.

Regex pattern: (\d+,)?\d+\.\d+\scash
Replacement pattern: $$$0

In this example, we want to add a dollar sign in front of any instances of an amount followed by a space and the word cash.

The regex pattern captures the various amounts by matching one or more digits followed by a comma (this first part is made optional by placing the regex in a group and adding the ? quantifier, which means 0 or 1 times), followed by one ore more digits followed by a point, followed by one or more digits, followed by a space and the word "cash".

Since the dollar sign is a special character in the replacement pattern, when we want to enter a literal dollar sign in the replacement, we must use $$. Thus, $$$0 means a dollar sign ($$) followed by the entire string ($0) that matches the regex.

See the replacement in action below.

For this article, we've looked at replacements in the Editor view, but SDL Trados Studio also accepts regex replacements in the Translation Memories view and regex replacement syntax in the Verification settings.

With this, we have come to the end of this article. If you'd like to have a copy of my cheat sheet, you can download it here:

 ¡Pregunta por los precios especiales de SDL Trados Studio para México!