Wednesday, November 25, 2015

Beyond Punctuation: Creating Custom Segmentation Rules in Studio

A freshly created TM in Studio comes with 3 standard segmentation rules, as shown below.


While these rules are enough in most cases, sometimes we'll open a file and wish it were segmented differently, as in this case:


A quick look at this file makes it clear that it would be a lot easier to handle if "Dry Time" and "Wait Time" were in their own separate segments, so a custom segmentation rule would come in handy.

This new rule won't be punctuation-related, but instead, it will be content-related. In other words, I need Studio to create a new segment whenever the text "Wait Time:" or "Dry Time:" is found.

Adding a New Segmentation Rule

To access the segmentation rules, follow the path shown below.


This opens the Segmentation Rules window, where we will add the new rule.


Clicking Advanced View takes us to this window:


This is where we will tell Studio what we want to do. Before proceeding, let's think about what we want to do.


As shown above, we want to add a segment break (represented by the yellow line) right before "Wait Time:" and "Dry Time:", both of which are preceded by a space. In the window above, I need to tell Studio what pattern can be found before the (segment) break and after the break. So, in this example:

Before the break there is a space

and

After the break there is either "Wait Time:" or "Dry Time:"

To tell Studio what I want to do, I will need to use regular expressions, which for this example are not too complicated.


Explanation:

Before break
\s

  • \s is the regular expression character for whitespace


After break
(Wait|Dry) Time:


  • The | indicates alternation, so it's telling Studio to look for either "Wait" or "Dry"
  • The parentheses are used to group the two alternatives, as otherwise Studio would look for "Wait" or "Dry Time:", that is, it would not combine "Wait" and "Time:"
Note that I'm including the colon in the "After break" expression. This is to prevent unwanted segmentation in segments like "Wait Time cannot be longer than Dry Time."

After clicking OK, the rule is now included in the list of segmentation rules.



After closing all the open windows. The new rule is now available to be applied during processing. 

To apply it to my file, I will need to first remove the file from my project, add it again and process it as usual, as shown in this short video.



And that's all there is to it! After re-processing the file, I now have the segmentation I wanted.





Friday, November 6, 2015

Combining TMs in Studio: TM Import and TM Upgrade


When we need to combine the contents of two or more TMs, we have two operations available in Studio: Importing and Upgrading Translation Memories.


The main difference is the end result:
Upgrading TMs will produce a new TM.
So, TM1 + TM2 = TM3

Importing TMs will modify an existing TM.
So TM1 + TM2 = TM1 (including all the contents of TM2)


Let's have a look at the basics of each method.

For this example I have two simple TMs called Red and Yellow, shown here in the Translation Memories view. I’ve added an identical TU (Yellow-Amarillo) to both TMs so we can see what happens to that kind of content.

image

image


Method 1: Upgrading TMs

Note: This will create a new TM that will include the contents of all the selected TMs.

First, I click Upgrade Translation Memories from the Home tab in the Translation Memories view.

image

For our example, I will add the two file-based TMs I created, so I select “Add File-based TMs…”, then go to the folder where my TMs are stored.

Note: Selecting All Supported Files in the drop-down menu next to the File name ensures that Studio TMs (sdltm) are also visible.

image

I select the files I want to merge and click Open.

image

Then I click Next.

In the Output Translation Memories window, the first option is to create one TM for each of the selected TMs. This is useful when upgrading a Legacy TM to Studio’s sdltm format. For our example I want to merge the two TMs, so I’ll use the second option: Create output translation memory for each language pair.

Note: Studio gives the new TM a default name that is made up of the source and target language codes. In this example, Studio gives me the default TM name en-US_es-MX. Double-clicking on this name allows me to edit it to give the TM a specific name; in this case, I’ve changed it to Red+Yellow.

image

Click Next.

The next window allows us to select some settings for the TM merge operation. The screenshot below shows the Settings tab. I left all the default settings intact for this example.

image

After clicking Finish, I see a window summarizing the operation. Looking at the list of steps, we can see that it’s basically an Export+Import sequence, where each of the selected TMs is first exported, then a new TM is created and finally, each of TMs is imported into the new TM.

image

This is what the new TM looks like. The identical TU in both TMs (Yellow – Amarillo) gets merged into a single TU in the new TM so there will be no duplicates.

image



Method 2: Importing one TM into another one


Note: With this method, the original TM is modified to include the new TUs.

If the TM to be imported is in sdltm format, it will first need to be exported to a TMX file.
For our example, I will import the contents of the Yellow TM into the Red TM.
First, to export the Yellow sdltm file to a TMX file, from the Translation Memories view, I select the TM and click Export in the ribbon.

image

After clicking Save I will have a TMX file.

Next, I select the Red TM and click Import, selecting the TMX file I just created.

image

After clicking Open and Next, the last screen before the actual Import operation allows me to choose some settings.

image

Again, I left all the default settings intact and clicked Finish. After the import operation is complete, the Red TM is showing the newly-added TUs. Since TU 3 (Yellow-Amarillo) is identical in both the original and the exported TM, only one copy is kept, avoiding duplicates.

image























And that’s all there is to it.

Both operations are simple and easy to access, and it’s a just a matter of deciding which output best fits our intended purpose.