Using Regular Expressions To Find Common Errors – Guest Post by Russell Phillips

LibreOfficeI have a great editor, but I understand that she is human, and therefore she makes mistakes, and misses things, just like I do. Therefore, I like to try and make my manuscript as good as I can before I hand it over to her. The trouble with editing your own work, of course, is that all too often, your brain sees what is supposed to be there, not what is actually there.

One tool I use for finding errors is regular expressions. Regular expressions are like search and replace on steroids. Instead of finding simple strings of text, regular expressions provide a way to find patterns within the text. This makes them ideal for finding certain types of error that can occur all too easily when writing a long piece of text. The use of copy & paste, deleting, etc, can mean that even simple grammatical mistakes or typos can slip in and not be noticed.

Below I have listed some regular expression searches that I currently use on my manuscripts before sending them to my editor. Note that they are formatted with a different background colour because spaces at the start or end can be important. It is possible to use regular expressions to replace text, but I haven’t included replacement expressions because I prefer to be cautious and make corrections manually. I’ve tried to order them in increasing complexity, and I’ve included some explanatory text for each one.

The expressions given below should work in LibreOffice and Scrivener version 2.4 or later (earlier versions don’t support regular expressions). Microsoft Word also supports regular expressions, although the syntax is rather unusual, so you’ll need to check the documentation for help. Whichever software you use, you will have to tell it that you’re doing a regular expression search, rather than a normal text search. In LibreOffice Writer, use the “Find and Replace” function (not “Find”). Click “Other Options” in the dialogue box, and tick the “Regular expressions” tickbox. In Scrivener project search, select “RegEx” from the operator section of the magnifying glass icon menu. In Scrivener document find, select “Regular Expressions (RegEx)” from the “Find Options” drop-down menu.

Note that, when copying and pasting from your browser into the search box, make sure that the quotation marks are correct – they sometimes get mangled.

Punctuation And Quotation Marks

This is a simple expression, but there are two versions. In British English, the convention is to have commas and full stops outside quotation marks, whereas in US English, commas and full stops are placed inside the quotation marks.

Expression to find commas and full stops inside quotation marks (use this if you write in British English):

[.,]“

Expression to find commas and full stops outside quotation marks (use this if you write in US English):

“[.,]

These simple expressions match a quotation mark followed or preceeded by a full stop or a comma. Square brackets are used to group characters, so that if any character in the square brackets is present, a match is found. In this case, the square brackets are used to match a full stop or comma, but nothing else.

“a” instead of “an”

This expression will find words that begin with a vowel immediately preceeded by “a”, instead of “an”:

 a [aeiou]

The first three characters are simple: space, lower case “a”, space. Then square brackets are used to group all five vowels. Note that the “Match case” option must be selected in LibreOffice for it to work correctly.

Oxford Commas

At school, I was taught not to use Oxford commas, but I use them in my books because they can avoid ambiguity. Unfortunately, because I didn’t use them for so long, I frequently forget to add them. Consequently, one of the first regular expressions I wrote to check for errors in my writing was to spot missing Oxford commas. Note that this won’t find every sentence that is missing an Oxford comma, but that’s why you have a human editor 🙂

\w+, \w+ and 

If you have the opposite problem, and you don’t want Oxford commas, the following expression should find them:

\w+, \w+, and 

“\w” matches a word character, ie any character that can be part of a word (letters, numbers, etc). The “+” means at least one of the preceeding character must be present, so “\w+” matches a word.

Missing Capital After Full Stop

I started using this expression after seeing this error in a book published by HarperCollins. If the big publishers can miss such basic mistakes, so can the rest of us.

Note that the “Match case” option must be selected in LibreOffice for it to work correctly. Acronyms followed by lower case letters, eg “The N.C.O. said” will not be matched.

[^.][^A-Z]\. [a-z]

This expression introduces a new twist on the use of square brackets: if the first character in the square brackets is a “^”, it matches anything NOT in the group. So, “[^.][A-Z]” matches anything that is not a full stop, followed by anything that is not an uppercase letter. The next term is “.”, which matches a full stop. When not in square brackets, a full stop is a wildcard, but placing a backslash before it tells the regular expression engine to treat it as a full stop, not as a wildcard. Finally, it matches a space followed by a lowercase letter.

Missing Brackets

It’s far too easy to forget to close brackets, or to accidentally delete the closing bracket. This expression will find an opening bracket that doesn’t have a matching closing bracket.

\([^)]*$

Since parantheses have a special meaning in regular expressions, the opening bracket is prefixed with a backslash. This tells the regular expression engine to treat it as a simple opening bracket. The “[^)]” matches any character that is not a closing bracket, and the “*” means “match this zero or more times”. Finally, the “$” indicates the end of the line/paragraph.

Repeated Word

Repeated words crop up sometimes, and often aren’t noticed if the word happens to appear at the end of one line and the start of the next line.

\b(\w+)\b \b\1\b

This one may look rather odd, but is simple once you understand it. As above, “\w+” is used to match a word. The parentheses are used to group the characters that are matched, so that they can be referred to later in the expression. The “\1” matches the group in the parentheses. “\b” denotes a word boundary. In this case, it is used to ensure that only complete words are matched. Without the word boundaries, it would match a term like “anderson song” as the “son” would be matched in both words.

Putting all that together, this expression matches a complete word, followed by at least one space, followed by the same complete word.

Want To Learn More?

If you want to learn to write regular expressions to find the mistakes that you find yourself making, www.regular-expressions.info is an excellent learning resource, and regex101.com has a regular expression tester, which will also explain the elements of the regular expression. Finally, feel free to ask questions in the comments, and I will try to help.

Post a Comment