How to on regular expressions?

hawksbill

References:
Guide for basic & extended regular expressions - https://en.wikibooks.org/wiki/Regular_Expressions/POSIX-Extended_Regular_Expressions
Man page for pcre - https://www.lightnetics.com/post/3347

Note: A variety of tools use regular expressions, be careful and test them on sample files first, different tools may give different results, when using them for modifying or deleting, you want to ensure predictable results.

Basic regular expressions.

The Anchor -

The basics, the anchor determines the position in the line.

Referred to as caret, not the eating variety. The caret is the beginning of the line. (start anchor)

^

The dollar is the end of the line. (end anchor)

$

Two sample files are used in these examples, the grepfile and the numbersfile

$ cat grepfile
We're going to need a bigger boat
Go ahead make my day
The brothers in arms
The rain in spain stays mainly on the plain

$ cat numbersfile 
56789

23653

457

Search for W at beginning of line.

$ grep ^W grepfile

We're going to need a bigger boat

Search for letters ay at the end of the line.

$ grep ay$ grepfile

Go ahead make my day

Seach for word the in file.

$ grep the grepfile

Displays the wrong output.
The brothers in arms
The rain in spain stays mainly on the plain

To get just the word the, put spaces around the word, using singel quotes to preserve spaces in grep.

$ grep ' the ' grepfile

The rain in spain stays mainly on the plain

The dot matches any character.

.

Match any word with a double e letter in it.

$ grep .ee grepfile

We're going to need a bigger boat

The open/close square brackets matches match ranges. Works for numbers and letters. The range has to be from low to high.

[...]

Search for any letters between u and z in the file.

$ grep [u-z] grepfile

Go ahead make my day
The rain in spain stays mainly on the plain

The caret as the first character in square brackets is an exception. Exclude all characters in the square brackets.

[^...]

Exclude characters a through z and characters single quote, G and T. The single quote has to be escaped with a backslash, because it's a special character to the shell. Everything character was excluded but the W

$ grep [^a-z\'GT] grepfile

We're going to need a bigger boat
Go ahead make my day
The brothers in arms
The rain in spain stays mainly on the plain

To match zero or more digits. Notice it matches the "zero or more", it include the line with zero matching lines.

$ grep [0-9]* numbersfile

56789

23653

457

To match one or more digits. Repeat the character set, notice the blank lines are no longer show.

$ grep [0-9][0-9]* numbersfile

56789
23653
457

The curly brackets escaped with a blackslash, match the number of character sets.

\{n,n\}

Match the character set a through z with 4, 5 ,6 7, or 8 characters.

$ grep '[a-z]\{4,8\}' grepfile

Match only one uppercase T at the beginning of the line..

$ grep '^T\{1\}' grepfile

The left arrow and right arrow escaped with a blackslash, matches words

\<...\>

Match all "The or the" words in a file.

$ grep '\<[tT]he\>' grepfile

The the round brackets escaped with a blackslash and \1 to remember patterns.

$...$

Match only the word level, using the remember pattern.

$ echo test level pop | grep '\([a-z]\)\([a-z]\)[a-z]\2\1'

test level pop

Extended regular expressions.

Extended regular expressions the {, }, <, >, (, ), and \digit, have no special meaning, they do not use the backslash.

The following examples are just demonstrate what the characters do in regular expressions.

The dot matches any single character.

The dot print the entire file because it matches any single character. See the how the match is all in red.

The asterisk matches zero or more single character that precedes it.

Here we match the letter "a" and anything after it.

The caret matches the regular expression that follows it at the beginning of the line

Here we show that by just using "^" on its own prints the entire file because each line has a beginning of line.

If we add a letter to it, like G, it matches the any line beginning with the letter G.

The dollar matches the regular expression that precedes it at the end of the line.

To demonstrate every line has a end of line character.

Print the any line which has a "y" at the end if it.

The [...] matches a range of characters.

Here we match the "a" and "i" characters in the file.

Here we match any characters from a through d.

Here we match any characters that are not (using the caret) a through d.

? matches zero or more

instances of preceding regular expression.

Here we match any word that has "a" or "i" followed by an "n", because it's zero or more, if there was a string aiin or aain it would also match these.

+ matches one or more instances of preceding regular expression.

Notice the difference between the ? and +

| matches the regular expression specified before or after.

Here we match one or more "a" or "i" followed an "n" or anything with "go".