2) + => it matches one or more occurrences of preceding
character.
+ -----> ab+c
ac #invalid
abc
abbc
abbbbbc
3) ? => it matches zero or one occurrence of preceding
character.
? ---->ab?c
Ac
Abc
Abbc #invalid
Perl,pearl => pea?rl
Color,colour=>colou?r
4) . => it matches any single character.
. ------> a.c
agc
a5c
a$c
a c
abcd #invalid
5)[ ] => it matches any single character in the given list.
[xyz…] ----->b[aeiou]d
bad
bed
bid
bod
bud
b8d #invalid
bpd #invalid
6) [^] => it matches any single character other than in the
give list.
[^xyz…]------>b[^aeiou]d
Bad #invalid
Bed #invalid
Bid #invalid
Bod #invalid
Bud #invalid
B8d
Bpd
7) [-] => it matches any single character in the given
range.
z[a-e]y
xay
xby
xcy
xdy
xey
xfy #invalid
xpy #invalid
[0-9] --->any single digit
[a-z] --->any one lowercase alphabet
[A-Z] --->any one uppercase alphabet
[a-zA-Z] --->any one alphabet
[a-zA-Z0-9_] --->any one alphanumeric
[^0-9] --->any single non digit
[^a-z] --->any one non lowercase alphabet
[^A-Z] --->any one non uppercase alphabet
[^a-zA-Z] --->any one non alphabet
[^a-zA-Z0-9_] --->any one non alphanumeric(special
characters)
8) ( | ) =>match any one string in the list.
(java|hadoop|python)
9) {m} =>it matches exact occurrence of preceding
character.
ab{3}c
abbc #invalid
abbbc
abbbbbc #invalid
10) {m,n} => it matches min m occurrences and max n
occurrences of its preceding character.
ab{3,5}c
abbc #invalid
abbbc
abbbbc
abbbbbc
abbbbbbc #invalid
11) {m,} => it matches min m occurrences and max no limit
of its
preceding character
ab{3,}c
abbc #
abbbc
abbbbbbbc
12) ^ => start of the line
^perl
^[abc]
^[^abc]
13) $ => end of the line
Perl$
[0-9]$
14) \d or [0-9] => any single digit.
[0-9][0-9][0-9][0-9] or [0-9]{4} or \d\d\d\d or \d{4}
15) \D or [^0-9] => any single non digit
16) \w or [a-zA-Z0-9_] => any alphanumeric
17) \W or [^a-zA-Z0-9_] => any non alphanumeric or special
character
18) \s => ’ ‘,’\t’,’\n’
19) \b =>word boundary
- To avoid any confusion
while dealing with regular expressions, we would use Raw Strings as r'expression'
Functions in re module:
match( ): Match a regular expression pattern to the
beginning of a string.
re.match(pattern, string, flags=0)
search( ): Search a string for the presence of a pattern.
re.search(pattern, string, flags=0)
sub( ): Substitute occurrences of a pattern found in a
string.
re.sub(pattern, repl, string, max=0)
subn( ): Same as sub, but also return the number of
substitutions made.
split( ): Split a string by the occurrences of a pattern.
re.split(pattern, string)
findall( ): Find all occurrences of a pattern in a string.
re.findall(pattern, string)
finditer( ): Return an iterator yielding a match object
for each match.
re.finditer(pattern, string)
compile( ): Compile a pattern into a RegexObject.
re.compile(pattern)
Matching Versus
Searching
match checks for a match only at the beginning of the string,
while search checks for a match
anywhere in the string .
Search and Replace
One
of the most important re methods
that use regular expressions is sub.
re.sub(pattern, repl, string, max=0)
This method
replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided. This method returns modified string.
greedy matching:
The notion that the "+" and "*" characters in a regular expression expand outward to match the largest possible string.
wild card:
A special character that matches any character.
In regular expressions the wild card character is the period character.
character Classes:
In many cases, rather than matching one particular character we want to match any one of a set of characters.
This can be achieved by using a character class—one or more characters enclosed in square brackets.
Symbol
|
Meaning
|
.
|
Matches any character except newline, any character at all with the re.DOTALL flag, or inside a character class matches a literal period.
|
\d
|
Matches a Unicode digit, or [0-9] with the re.ASCII flag.
|
\D
|
Matches a Unicode nondigit, or [^0-9] with the re.ASCII flag.
|
\s
|
Matches a Unicode whitespace, or [ \t\n\r\f\v] with the re.ASCII flag.
|
\S
|
Matches a Unicode non-whitespace, or [^ \t\n\r\f\v] with the re.ASCII flag.
|
\w
|
Matches a Unicode "word" character, or [a-zA-Z0-9_] with the re.ASCII flag.
|
\W
|
Matches a Unicode non-"word" character, or [^a-zA-Z0-9_] with the re.ASCII flag.
|
Quantifiers
A quantifier has the form {m,n} where m and n are the minimum and maximum times the expression to which the quantifier applies must match.
Syntax
|
Meaning
|
e? or e{0,1}
|
Greedily match zero occurrences or one occurrence of expression e.
|
e?? or e{0,1}?
|
Nongreedily match zero occurrences or one occurrence of expression e.
|
e+ or e{1,}
|
Greedily match one or more occurrences of expression e.
|
e+? or e{1,}?
|
Nongreedily match one or more occurrences of expression e.
|
e* or e{0,}
|
Greedily match zero or more occurrences of expression e.
|
e*? or e{0,}?
|
Nongreedily match zero or more occurrences of expression e.
|
e{m}
|
Match exactly m occurrences of expression e.
|
e{m,}
|
Greedily match at least m occurrences of expression e.
|
e{m,}?
|
Nongreedily match at least m occurrences of expression e.
|
e{,n}
|
Greedily match at most n occurrences of expression e.
|
e{,n}?
|
Nongreedily match at most n occurrences of expression e.
|
e{m,n}
|
Greedily match at least m and at most n occurrences of expression e.
|
e{m,n}?
|
Nongreedily match at least m and at most n occurrences of expression e.
|
Regular Expression Basics |
. | Any character except newline |
a | The character a |
ab | The string ab |
a|b | a or b |
a* | 0 or more a's |
\ | Escapes a special character |
Regular Expression Character Classes |
[ab-d] | One character of: a, b, c, d |
[^ab-d] | One character except: a, b, c, d |
[\b] | Backspace character |
\d | One digit |
\D | One non-digit |
\s | One whitespace |
\S | One non-whitespace |
\w | One word character |
\W | One non-word character |
Regular Expression Flags |
i | Ignore case |
m | ^ and $ match start and end of line |
s | . matches newline as well |
x | Allow spaces and comments |
L | Locale character classes |
u | Unicode character classes |
(?iLmsux) | Set flags within regex |
Regular Expression Quantifiers |
* | 0 or more |
+ | 1 or more |
? | 0 or 1 |
{2} | Exactly 2 |
{2, 5} | Between 2 and 5 |
{2,} | 2 or more |
(,5} | Up to 5 |
Regular Expression Assertions |
^ | Start of string |
\A | Start of string, ignores m flag |
$ | End of string |
\Z | End of string, ignores m flag |
\b | Word boundary |
\B | Non-word boundary |
(?=...) | Positive lookahead |
(?!...) | Negative lookahead |
(?<=...) | Positive lookbehind |
(?<!...) | Negative lookbehind |
(?()|) | Conditional |
Regular Expression Special Characters |
\n | Newline |
\r | Carriage return |
\t | Tab |
\YYY | Octal character YYY |
\xYY | Hexadecimal character YY |
Regular Expression Groups |
(...) | Capturing group |
(?P<Y>...) | Capturing group named Y |
(?:...) | Non-capturing group |
\Y | Match the Y'th captured group |
(?P=Y) | Match the named group Y |
(?#...) | Comment |
Regular Expression Replacement |
\g<0> | Insert entire match |
\g<Y> | Insert match Y (name or number) |
\Y | Insert group numbered Y |
Splitting Strings on Any of Multiple Delimiters
The split () method of string objects is really meant for very simple cases, and does not allow for multiple delimiters.
In cases when you need a bit more flexibility, use the re.split () method:
message="hai,siva krishna;hw r u"
import re
re.split(r'[,;\s]\s*',message)
['hai', 'siva', 'krishna', 'hw', 'r', 'u']
import re
regex=r"[a-zA-Z]+ \d+"
matches=re.findall(regex,"June 24,August 9,Oct 13,Dec")
for match in matches:
print("Full match: ",match)
import re
regex=r"([a-zA-Z]+) \d+"
matches=re.findall(regex,"June 24,August 9,Oct 13,Dec")
for match in matches:
print("Full match: ",match)
import re
regex=r"[a-zA-Z]+ \d+"
matches1=re.findall(regex,"June 24,August 9,Oct 13,Dec")
matches2=re.finditer(regex,"June 24,August 9,Oct 13,Dec")
print(type(matches1))
print(type(matches2))
for match in matches1:
print(match)
for match in matches2:
print(match.start( ),match.end( ))
import re
regex=r"([a-zA-Z]+) (\d+)"
print(re.sub(regex,r"\2 of \1","June 24,August 9,Oct 13,Dec"))
import re
regex=r"([a-zA-Z]+) (\d+)"
x=re.sub(regex,r"\2 of \1","June 24,August 9,Oct 13,Dec")
regex1="\d+ [a-zA-Z]+ ([a-zA-Z]+)"
matches=re.findall(regex1,x)
for match in matches:
print(match)
import re
regex=re.compile(r"(\w+) World")
result=regex.search("Hello World is the easiest")
print(result)
if result:
print(result.start(),result.end())
import re
regex=re.compile(r"(\w+) World")
x=regex.findall("Hello World.Hai World")
for result in x:
print(result)
import re
regex=re.compile(r"(\w+) World")
print(regex.sub(r"\1 Earth","Hello World"))
import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
if matchObj:
print "matchObj.group( ) : ", matchObj.group( )
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)
else:
print "No match!!"
import re
line = "Cats are smarter than dogs";
searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)
if searchObj:
print "searchObj.group( ) : ", searchObj.group( )
print "searchObj.group(1) : ", searchObj.group(1)
print "searchObj.group(2) : ", searchObj.group(2)
else:
print "Nothing found!!"
import re
line = "Cats are smarter than dogs";
matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
print "match --> matchObj.group( ) : ", matchObj.group( )
else:
print "No match!!"
searchObj = re.search( r'dogs', line, re.M|re.I)
if searchObj:
print "search --> searchObj.group( ) : ", searchObj.group( )
else: