Nora Díaz on Translation, Teaching, and Other Stuff: Regular Expressions for Translators: Character Classes

In previous articles, we have learned about metacharacters, anchors and special characters. In this article we’ll talk about character classes.

Download Nora’s Regular Expressions for Translators Cheat Sheet

First, let’s understand which characters are included in or excluded from each character class.

Let’s consider the following strings containing English and Spanish characters. This is not meant to be an exhaustive list of characters, but rather something we can use to understand the concept of character classes.

abcdefghijklmnñopqrstuvwxyz ABCDEFGHIJKLMNÑOPQRSTUVWXYZ
áéíóúÁÉÍÓÚ 1234567890-!"#$%&/()=?¡[¨*]{}´+,.;:_

To clearly visualize which characters are matched when each character class is used, I will use the above strings and a regex tester called Regex Hero, which alternates between yellow and orange highlighting to show each subsequent match.

\s will match a space, a tab, a line feed (soft return) or a carriage return (hard return). In the example below there are 4 matches: a space, a tab and a soft return at the end of the first line and a space in the second line.

\S will match anything that is not a white space, so, as expected, the 4 matches from the previous example are excluded, while each of the remaining characters is matched, for a total of 100 matches in our sample strings.

As shown below, each of the 10 digits in the sample text is matched when we use \d.

Using \D will match anything that is not a digit, so all of the other characters, including white spaces, are matched here.

\w will match any word character, which includes all the letters, numbers and the underscore, as shown below.

\W will match anything that is not matched by \w, including white spaces.

Note that these regexes match one instance of the corresponding character. To match more than one instance, we will learn about quantifiers in a future article.

The video below shows the regexes being used with the Find operation in SDL Trados Studio.

Application
Now let’s use these character classes combined with the start of segment and end of segment anchors and escaped metacharacters to see some possible use cases with the display filter in SDL Trados Studio.

Here's the unfiltered text I will use for these examples:

For this demonstration, instead of adding actual translations on the target side, I have just copied the source over and have added some superfluous spaces and tabs.

Example 1: Filter on target segments that start with a white space (space or tab)

Regex: ^\s

Example 2: Filter on target segments that end in a white space

Regex: \s$

Example 3: Filter on target segments that end in a number

Regex: \d$

Example 4: Filter on target segments that start with a word character

Regex: ^\w

Example 5: Filter on target segments that don't end in a word character

Regex: \W$

Example 6: Filter on target segments that end with a space followed by a period

Regex: \s\.$

Example 7: Filter on target segments that end in a "not white space" character followed by a question mark

Regex: \S\?$

Example 8: Filter on target segments that end in a digit followed by a "not word" character followed by a period

Regex: \d\W\.$

As we can see, by combining character classes, anchors and escaped metacharacters, we can start enhancing our use of SDL Trados Studio's display filter.

Remember that in addition to the display filter, SDL Trados Studio accepts regular expressions in the Find and Replace dialog box, the segmentation rules and the verification settings.

Nora Díaz on Translation, Teaching, and Other Stuff

Thursday, November 7, 2019

Regular Expressions for Translators: Character Classes

No comments:

Post a Comment