Tuesday, June 16, 2015

Regex


Metacharacters have some special meanings in Regex: 


Backslash \
Caret ^
Dollar sign $
Dot .
Pipe |
Question mark ?
Asterisk *
Plus sign +
Parenthesis ()
Square bracket []
Curly brace {}

Note: To use any of above-mentioned characters as a literal in regular expression, you have to escape them with a backslash so to match 2+2=4, enter 2\+2=4. If you do not escape +, it has its special metacharacter meaning. 
The backslash escapes a special character, which means that character gets interpreted literally so \$ means $, rather than its Regex meaning. Likewise \\ has the literal meaning of \

If you are using grep, to find the literal asterisk character in a file, use single quotes, otherwise it shows everything in the file:
$ grep '*' /etc/profile


Brackets

Brackets enclose a set of characters to match in a single regex. so to match an a or an e, use [ae]. You may use this in gr[ae]y to match either gray or grey.

Hyphen is used to show a range of characters so [0-9] matches a single digit between 0 and 9. You may use more than one range like [0-9a-mA-M]. You may also combine single characters and ranges like [0-9a-mzA-MZ] which matches 0 to 9, a to m and A to M plus z and Z. 

Now we are ready to match common word patterns by using combined sequences of characters in square brackets:

[0-9][0-9][0-9][0-9][0-9] matches any US zip code and [Bb][Ee][Hh][Nn][Aa][Mm] matches BEHNAM, Behnam, behNAm, etc 

Example: To list all files in the current directory which start with letter a, b, c, m, or in the range of u to z: (tip: use -d option to avoid getting messy stdout)
$ ls -ld [a-cmu-z]*

Caret

If the the pattern within the square braces starts with ! or ^, any character not enclosed will be matched. I mean inserting ^ after the opening bracket negates the character class so the result is that the character class matches anything that is not in the character class. As an example b[^x] matches "be" in "behnam" but it does not match "pub" because we do not see any character after "b" in "pub". Or [^x-zX-Z] matches any character except those characters in the range of x to y. 

For the use of caret as an anchor, wait a minute to reach to Anchors section of this tutorial. Anchors do not match any character. They match a position before, after, or in between characters. 


Dot

Dot matches almost any character, I mean it matches a single character, except line break characters. So we can say dot is the short form of [^\n]. As an example Behn.m matches Behnam, Behnom, Behn#m, but not Behnm or Behnaam and B... matches Beer and Bear but not Bug.  

Example: To get all six-character words starting with b and ending in m simply enter: 
$ grep '\<b....m\>' /usr/share/dict/words
Note: If the file /usr/share/dict/words does not exists, install words package by issuing: yum install words


Asterisk, Plus and ?


  • Asterisk matches any number of previous characters, including zero instance of characters. 
  • Plus sign is like asterisk but matches one or more previous characters. 

  • ? is also similar to asterisk but matches 0 or 1 of the previous characters. It is generally used for matching single characters like colo?r which matches colour or color.   
Example: [a-zA-Z]* matches zero or more letters, and tries to match as many characters as possible to the end of the word. 

Example: <[A-Za-z][A-Za-z0-9]*> matches an HTML tag with no attributes. <[A-Za-z0-9]+> seems to be easier to write but it matches invalid tags such as <5>


Anchors

Anchors do not match any characters. Anchors match a position. 
^ matches at the start of the string, and $ matches end of the string so ^Behnam matches Behnam at the beginning of a line and Pournader$ matches Pournader at the end of a line. 
^Behnam$ matches lines with only Behnam word. 
^B matches only the first B in BehBam.

Note: As you saw previously in Brackets section, the caret matches the beginning of a line, but sometimes negates the meaning of a set of characters.

Example: to display lines starting with the string "root":
$ grep ^root /etc/passwd

Example: in order to see which accounts have no shell assigned:
‍$ grep :$ /etc/passwd

As we said earlier, $ at the end of a Regex matches the end of a line so Pournader$ matches Pournader at the end of a line and ^$ matches blank lines.

Example: in order to see which accounts have bash as shell:
‍$ grep bash$ /etc/passwd


Word Boundaries 

The angle brackets must be escaped, otherwise they have their literal meanings. \< and \> mark word boundaries: /< matches beginning of a word and \> matches the end of a word. As an example \<the\> matches the word "the" itself but not the words "them", "there", "other", and so on. 

\b Matches the empty string at the edge of a word.
\B Matches the empty string provided it's not at the edge of a word.
\< Match the empty string at the beginning of word.
\> Match the empty string at the end of word.


Alternation

Alternation is Regex equivalent of "or". As an example Toyota|Honda matches Toyota in "I have a Toyota and a Honda". If the regular expression is applied again, it matches Honda too. We can add as many alternatives as we want: Toyota|Honda|Ford|Subaru.

Important Note: Alternation has the lowest precedence of all other operators so 
"Toyota|Honda tire" matches "Toyota" or "Honda tire". To match "Toyota ire" or "Honda tire", we have to group them as: (Toyota|Honda) tire.

Example:
‍$ grep 'be(a|e)r' testfile.txt


Repeating a Pattern

To specify a specific amount of repetition, use curly braces:

  • [1-9][0-9]{3} matches a number between 1000 and 9999
  • [1-9][0-9]{2,4} matches a number between 100 and 99999
  • [0-9]\{5\} matches exactly five digits

Laziness and Greediness

Sometimes Regex does not seem to behave the way you had expected because Regex is very greedy and it matches as large as it can. I mean the answer of 
^F.+: on "From: using the :abc" string is the largest possible match which is 
"From: using the :" not "From:". 
The solution is adding ? which means please be lazy and stop at the 1st so ^F.+?: will match the smallest match which is "From:"

As an another example, the regex <.+> matches <EM>second</EM> in "This is my <EM>second</EM> test" html string. Again, to make it lazy place a question mark after the quantifier so <.+?> matches <EM>.

For more information on this subject consult this link


Back Reference

You can use the back reference \1 to match the same text that was matched by the capturing group. 

Example: ([xyz])=\1 matches x=x, y=y, and z=z


Non-Printable Characters

Use \t to match a tab character (ASCII 0x09), and \n for line feed (0x0A). 

Note: Bear in mind that text files in Microsoft Windows use \r\n to terminate lines. UNIX text files simply use \n


Shorthand Character Classes

\d matches a single character that is a digit 
\w matches a "word character" (alphanumeric characters plus underscore)
\s matches a white-space character (includes tabs and line breaks)

Labels: ,