A world of Regex: The Basics
Recently I did a brief overview of sed and all its glory, but in the wide world of Admin Ninja Wizardry(tm) basic sed knowledge alone will not net you millions a year and thousands of adoring fangirls/boys/whatever you are into. I mean, neither will regex knowledge…But it will at least make sed/grep/other tools that can utilize regex work better for you right? That’s a plus… Or at least not a negative.
- Basic Symbols
The first thing to be aware of in regex is that there are some basic symbols that have some deep meaning.
'.' #A period Matches any single character. '*' #A Splat(*) Matches any, including 0, number of whatever character PRECEDES it. '^' #A Caret(yeah, its called a caret) Means whatever comes after it is the FIRST thing on a line. '$' # The dollar sign indicates what ever item came BEFORE it is the LAST thing on a line.
(we’ll use grep as a point of reference here)
one test test test test teest blue orange green
grep .... testfile.txt
test test test test teest blue orange green
As you can see 4 dots finds everything in the file except for the word “one” This is because regex is inherently greedy. It will match an item so long as it meets the requirements, which going over the requirements. So if a line contains at least 1 word with 4 characters then it matches! This is an important thing to realize, because thinking that matching 4 characters will only match words with 4 characters, is an idea that can lead you to some very wrong regex that can cause you a world of headaches.
So what if you wanted to find any line’s that ONLY contained a single word that was 4 characters long?
This is where the idea of ^’s and $’s come in.
grep ^....$ testfile.txt
test test blue
As you can see, this time we matched the two lines that only contained the word “test” as well as the line that contained the word “blue”
Now lets talk about using the ‘*’, and how you can start making more complicated requirements. The ‘*’ Allows you to specify that the character that came BEFORE it, could be repeated any number of times, from 0-infinity. This is often a hard thing for new regex wizards to get used to because in MANY cases we use ‘*’ to mean “Match anything”, but when it comes to the power of regex, its a different story.
So let’s say that we become aware that our “e” Key has been sticking, and in our rush of typing, we haven’t noticed, to test this first lets insert a couple new words into our test file
echo teeeeeeeeeeeeeeeest >> testfile.txt echo teeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeest >> testfile.txt
so first of all, we are not concerted with words that contain a single e, those are fine. We want words that contain 2 or more ee’s. Enter the Splat.
grep eee* testfile.txt
teest green teeeeeeeeeeeeeeeest teeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeest
So with this, we also got our word “teest” which was there before, probably an early symptom of our sticking e, we also got green, which is of course correct, but we found the lines were looking for as well.
So… Why did we need 3 “e”‘s instead of just “2” to mean “2 or more”? We have to remember that the “*” symbole matches any number of the character its attached to, INCLUDING 0. so if you were to use “e*” for example, it would match anything, that had 0 or more matches of “e” so…Anything. “ee*” would match “e followed by 0 or more more e’s” so, anything that had at least 1 e, and thus, we need “eee*” to say “we want at least 2 ee’s in a row, and possibly any number of more e’s after it”
And those are the basics. Tune in for some more advanced regex in the future!