Python Tutorial by Siva Krishna: Regular Expressions

Regular Expressions:

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

Regular Expressions are used to

Extracting the required data from the given data.

To perform data validations.

To develop the URL patterns in the web applications.

In Regular Expressions we use some special characters to define the patterns.

After defining the pattern we can extract that pattern matching data from the given data by using pre-defined functions of re module.

re is a inbuilt module of the python

special characters:

1) * => it matches zero or more occurrences of preceding

character.

* ---->ab*c

abc

abbc

abbbbbc

2) + => it matches one or more occurrences of preceding character.

+ -----> ab+c

ac #invalid

abc

abbc

abbbbbc

3) ? => it matches zero or one occurrence of preceding character.

? ---->ab?c

Abc

Abbc #invalid

Perl,pearl => pea?rl

Color,colour=>colou?r

4) . => it matches any single character.

. ------> a.c

agc

a5c

a$c

a c

abcd #invalid

5)[ ] => it matches any single character in the given list.

[xyz…] ----->b[aeiou]d

bad

bed

bid

bod

bud

b8d #invalid

bpd #invalid

6) [^] => it matches any single character other than in the give list.

[^xyz…]------>b[^aeiou]d

Bad #invalid

Bed #invalid

Bid #invalid

Bod #invalid

Bud #invalid

B8d

Bpd

7) [-] => it matches any single character in the given range.

z[a-e]y

xay

xby

xcy

xdy

xey

xfy #invalid

xpy #invalid

[0-9] --->any single digit

[a-z] --->any one lowercase alphabet

[A-Z] --->any one uppercase alphabet

[a-zA-Z] --->any one alphabet

[a-zA-Z0-9_] --->any one alphanumeric

[^0-9] --->any single non digit

[^a-z] --->any one non lowercase alphabet

[^A-Z] --->any one non uppercase alphabet

[^a-zA-Z] --->any one non alphabet

[^a-zA-Z0-9_] --->any one non alphanumeric(special characters)

8) ( | ) =>match any one string in the list.

(java|hadoop|python)

9) {m} =>it matches exact occurrence of preceding character.

ab{3}c

abbc #invalid

abbbc

abbbbbc #invalid

10) {m,n} => it matches min m occurrences and max n

occurrences of its preceding character.

ab{3,5}c

abbc #invalid

abbbc

abbbbc

abbbbbc

abbbbbbc #invalid

11) {m,} => it matches min m occurrences and max no limit of its

preceding character

ab{3,}c

abbc #

abbbc

abbbbbbbc

12) ^ => start of the line

^perl

^[abc]

^[^abc]

13) $ => end of the line

Perl$

[0-9]$

14) \d or [0-9] => any single digit.

[0-9][0-9][0-9][0-9] or [0-9]{4} or \d\d\d\d or \d{4}

15) \D or [^0-9] => any single non digit

16) \w or [a-zA-Z0-9_] => any alphanumeric

17) \W or [^a-zA-Z0-9_] => any non alphanumeric or special character

18) \s => ’ ‘,’\t’,’\n’

19) \b =>word boundary

To avoid any confusion while dealing with regular expressions, we would use Raw Strings as r'expression'

Functions in re module:

match( ): Match a regular expression pattern to the

beginning of a string.

re.match(pattern, string, flags=0)

search( ): Search a string for the presence of a pattern.

re.search(pattern, string, flags=0)

sub( ): Substitute occurrences of a pattern found in a

string.

re.sub(pattern, repl, string, max=0)

subn( ): Same as sub, but also return the number of

substitutions made.

split( ): Split a string by the occurrences of a pattern.

re.split(pattern, string)

findall( ): Find all occurrences of a pattern in a string.

re.findall(pattern, string)

finditer( ): Return an iterator yielding a match object

for each match.

re.finditer(pattern, string)

compile( ): Compile a pattern into a RegexObject.

re.compile(pattern)

Matching Versus Searching

match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string .

Search and Replace

One of the most important re methods that use regular expressions is sub.

re.sub(pattern, repl, string, max=0)

This method replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided. This method returns modified string.

greedy matching:

The notion that the "+" and "*" characters in a regular expression expand outward to match the largest possible string.

wild card:

A special character that matches any character.

In regular expressions the wild card character is the period character.

character Classes:

In many cases, rather than matching one particular character we want to match any one of a set of characters.

This can be achieved by using a character class—one or more characters enclosed in square brackets.

Symbol	Meaning
.	Matches any character except newline, any character at all with the `re.DOTALL` flag, or inside a character class matches a literal period.
\d	Matches a Unicode digit, or [0-9] with the `re.ASCII` flag.
\D	Matches a Unicode nondigit, or [^0-9] with the `re.ASCII` flag.
\s	Matches a Unicode whitespace, or [ \t\n\r\f\v] with the `re.ASCII` flag.
\S	Matches a Unicode non-whitespace, or [^ \t\n\r\f\v] with the `re.ASCII` flag.
\w	Matches a Unicode "word" character, or [a-zA-Z0-9_] with the `re.ASCII` flag.
\W	Matches a Unicode non-"word" character, or [^a-zA-Z0-9_] with the `re.ASCII` flag.

Quantifiers

A quantifier has the form {m,n} where m and n are the minimum and maximum times the expression to which the quantifier applies must match.

Syntax	Meaning
e? or e{0,1}	Greedily match zero occurrences or one occurrence of expression `e`.
e?? or e{0,1}?	Nongreedily match zero occurrences or one occurrence of expression `e`.
e+ or e{1,}	Greedily match one or more occurrences of expression `e`.
e+? or e{1,}?	Nongreedily match one or more occurrences of expression `e`.
e* or e{0,}	Greedily match zero or more occurrences of expression `e`.
e*? or e{0,}?	Nongreedily match zero or more occurrences of expression `e`.
e{m}	Match exactly `m` occurrences of expression `e`.
e{m,}	Greedily match at least `m` occurrences of expression `e`.
e{m,}?	Nongreedily match at least `m` occurrences of expression `e`.
e{,n}	Greedily match at most `n` occurrences of expression `e`.
e{,n}?	Nongreedily match at most `n` occurrences of expression `e`.
e{m,n}	Greedily match at least m and at most `n` occurrences of expression `e`.
e{m,n}?	Nongreedily match at least `m` and at most `n` occurrences of expression `e`.

Regular Expression Basics
.	Any character except newline
a	The character a
ab	The string ab
a\|b	a or b
a*	0 or more a's
\	Escapes a special character

Regular Expression Character Classes
[ab-d]	One character of: a, b, c, d
[^ab-d]	One character except: a, b, c, d
[\b]	Backspace character
\d	One digit
\D	One non-digit
\s	One whitespace
\S	One non-whitespace
\w	One word character
\W	One non-word character

Regular Expression Flags
i	Ignore case
m	^ and $ match start and end of line
s	. matches newline as well
x	Allow spaces and comments
L	Locale character classes
u	Unicode character classes
(?iLmsux)	Set flags within regex

Regular Expression Quantifiers
*	0 or more
+	1 or more
?	0 or 1
{2}	Exactly 2
{2, 5}	Between 2 and 5
{2,}	2 or more
(,5}	Up to 5

Regular Expression Assertions
^	Start of string
\A	Start of string, ignores m flag
$	End of string
\Z	End of string, ignores m flag
\b	Word boundary
\B	Non-word boundary
(?=...)	Positive lookahead
(?!...)	Negative lookahead
(?<=...)	Positive lookbehind
(?<!...)	Negative lookbehind
(?()\|)	Conditional

Regular Expression Special Characters
\n	Newline
\r	Carriage return
\t	Tab
\YYY	Octal character YYY
\xYY	Hexadecimal character YY

Regular Expression Groups
(...)	Capturing group
(?P<Y>...)	Capturing group named Y
(?:...)	Non-capturing group
\Y	Match the Y'th captured group
(?P=Y)	Match the named group Y
(?#...)	Comment

Regular Expression Replacement
\g<0>	Insert entire match
\g<Y>	Insert match Y (name or number)
\Y	Insert group numbered Y

Splitting Strings on Any of Multiple Delimiters

The split () method of string objects is really meant for very simple cases, and does not allow for multiple delimiters.

In cases when you need a bit more flexibility, use the re.split () method:

message="hai,siva krishna;hw r u"

import re

re.split(r'[,;\s]\s*',message)

['hai', 'siva', 'krishna', 'hw', 'r', 'u']

import re

regex=r"[a-zA-Z]+ \d+"

matches=re.findall(regex,"June 24,August 9,Oct 13,Dec")

for match in matches:

print("Full match: ",match)

import re

regex=r"([a-zA-Z]+) \d+"

matches=re.findall(regex,"June 24,August 9,Oct 13,Dec")

for match in matches:

print("Full match: ",match)

import re

regex=r"[a-zA-Z]+ \d+"

matches1=re.findall(regex,"June 24,August 9,Oct 13,Dec")

matches2=re.finditer(regex,"June 24,August 9,Oct 13,Dec")

print(type(matches1))

print(type(matches2))

for match in matches1:

print(match)

for match in matches2:

print(match.start( ),match.end( ))

import re

regex=r"([a-zA-Z]+) (\d+)"

print(re.sub(regex,r"\2 of \1","June 24,August 9,Oct 13,Dec"))

import re

regex=r"([a-zA-Z]+) (\d+)"

x=re.sub(regex,r"\2 of \1","June 24,August 9,Oct 13,Dec")

regex1="\d+ [a-zA-Z]+ ([a-zA-Z]+)"

matches=re.findall(regex1,x)

for match in matches:

print(match)

import re

regex=re.compile(r"(\w+) World")

result=regex.search("Hello World is the easiest")

print(result)

if result:

print(result.start(),result.end())

import re

regex=re.compile(r"(\w+) World")

x=regex.findall("Hello World.Hai World")

for result in x:

print(result)

import re

regex=re.compile(r"(\w+) World")

print(regex.sub(r"\1 Earth","Hello World"))

import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:

print "matchObj.group( ) : ", matchObj.group( )

print "matchObj.group(1) : ", matchObj.group(1)

print "matchObj.group(2) : ", matchObj.group(2)

else:

print "No match!!"

import re

line = "Cats are smarter than dogs";

searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:

print "searchObj.group( ) : ", searchObj.group( )

print "searchObj.group(1) : ", searchObj.group(1)

print "searchObj.group(2) : ", searchObj.group(2)

else:

print "Nothing found!!"

import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)

if matchObj:

print "match --> matchObj.group( ) : ", matchObj.group( )

else:

print "No match!!"

searchObj = re.search( r'dogs', line, re.M|re.I)

if searchObj:

print "search --> searchObj.group( ) : ", searchObj.group( )

else:

print "Nothing found!!"

Python Tutorial by Siva Krishna

Monday, 24 April 2017

Regular Expressions

Regular Expressions:

Functions in re module:

Matching Versus Searching

Search and Replace

character Classes:

Quantifiers

No comments:

Post a Comment

Followers