Sunday, November 10, 2019

Regular Expressions for Translators: Quantifiers

While character classes allow us to match a variety of characters, quantifiers bring more power to regular expressions by allowing us to specify how many times those characters should be matched.

              Download Nora’s Regular Expressions for Translators Cheat Sheet

Let's add each quantifier to a digit (\d) regex and look at the different results we get in a Find operation in SDL Trados Studio.

Regex: \d

The match for this regex is a single digit, as shown below.

Regex: \d*

Adding the * (zero or more) quantifier to \d gives us a different match: a series of consecutive digits.

However, in SDL Trados Studio, this regex will also match text that is not a digit. Clicking "Find Next" in the example above, causes this regex to match the comma.

Regex: \d+

Adding the + (one or more) quantifier to \d means that the match must have at least one digit.

Clicking "Find Next" here will skip the comma and match the next number group.

Note that this simple regex will match a group of numbers even if they are part of a string containing other characters. Have a look at this example, where I've added the letters AFG to the number 879.

Regex: \d?

Adding the ? (zero or one) quantifier to \d means that the match will be either zero digits or one digit.

Clicking "Find Next" will match each individual character in the segment, including the comma and letters.

Regex: \d{4}

A single number inside curly brackets is a quantifier that indicates that the preceding regex must be matched that exact number of times.

Clicking "Find Next" matches the next four-digit sequence in the segment.

Regex: \d{4,}

A number followed by a comma inside curly brackets indicates that the preceding regex must be matched that exact number of times, or more.

Clicking "Find Next" gives us the following result.

Regex: \d{3,5}

Two numbers separated by a comma (no space) inside curly brackets indicates a range of times that the preceding regex must be matched. In this example, it will be 3, 4 or 5 times.

Running a Find operation with this regex in the active segment below matches the first 3-digit group it finds.

Clicking "Find Next" two more times gives the following results.

Greedy or Lazy?

Quantifiers are greedy by default, which means that they will match as many occurrences of the regex pattern as possible. Consider the example below.

Regex: .+-

The dot character is a wildcard for any character, so this regex will match any character one or more times, as many times as possible (greedy) followed by a dash.

In the active segment below, it looks like there are three instances of groups of characters (numbers) followed by a dash: 614-, 597- and 7855-, but in fact, the first two dashes are interpreted as "any character" due to the use of the greedy quantifier +, which keeps matching "any character" as many times as possible until the last dash is found.

Making the quantifier lazy by adding a ? will cause the expression to recognize the first dash as a dash and not as any character, which results in a different match:

Each of the quantifiers above can be made lazy by adding a ? to it, which will result in the expression being matched as few times as possible.

Finally, while most of the examples above are based on digits, quantifiers can be used with any other regular expressions.

 ¡Pregunta por los precios especiales de SDL Trados Studio para México!

1 comment:

  1. Thanks for sharing, nice post! Post really provice useful information!

    Hương Lâm chuyên cung cấp máy photocopy, chúng tôi cung cấp máy photocopy ricoh, toshiba, canon, sharp, đặc biệt chúng tôi có cung cấp máy photocopy màu uy tín, giá rẻ nhất.