Monday, 24 April 2017

Regular Expressions

Regular Expressions:

  • regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. 

Regular Expressions are used to

  • Extracting the required data from the given data.
  • To perform data validations.
  • To develop the URL patterns in the web applications.


  • In Regular Expressions we use some special characters to define the patterns.
  • After defining the pattern we can extract that pattern matching data from the given data by using pre-defined functions of re module.
  • re is a inbuilt module of the python

special characters:



1) *  => it matches zero or more occurrences of preceding 

character.


*   ---->ab*c

ac

abc

abbc

abbbbbc


2) + => it matches one or more occurrences of preceding character.

+ -----> ab+c

ac   #invalid

abc

abbc

abbbbbc

3) ? => it matches zero or one occurrence of preceding character.

? ---->ab?c

Ac

Abc

Abbc #invalid

Perl,pearl => pea?rl

Color,colour=>colou?r

4) . => it matches any single character.

. ------> a.c

agc

a5c

a$c

a c

abcd #invalid

5)[ ] => it matches any single character in the given list.

[xyz…] ----->b[aeiou]d

bad

bed

bid

bod

bud

b8d #invalid

bpd #invalid

6) [^] => it matches any single character other than in the give list.

[^xyz…]------>b[^aeiou]d

Bad #invalid

Bed #invalid

Bid #invalid

Bod #invalid

Bud #invalid

B8d

Bpd

7) [-]  => it matches any single character in the given range.

z[a-e]y

xay

xby

xcy

xdy

xey

xfy #invalid

xpy #invalid

[0-9] --->any single digit

[a-z] --->any one lowercase alphabet

[A-Z] --->any one uppercase alphabet

[a-zA-Z] --->any one alphabet

[a-zA-Z0-9_] --->any one alphanumeric

[^0-9] --->any single non digit

[^a-z] --->any one non lowercase alphabet

[^A-Z] --->any one non uppercase alphabet

[^a-zA-Z] --->any one non alphabet

[^a-zA-Z0-9_] --->any one non alphanumeric(special characters)

8) ( | )  =>match any one string in the list.

(java|hadoop|python)

9) {m} =>it matches exact occurrence of preceding character.

ab{3}c

abbc #invalid

abbbc

abbbbbc #invalid

10) {m,n}  => it matches min m occurrences and max n 

occurrences of its preceding character.

ab{3,5}c

abbc #invalid

abbbc

abbbbc

abbbbbc

abbbbbbc #invalid

11) {m,}  => it matches min m occurrences and max no limit of its

 preceding character

ab{3,}c

abbc #

abbbc

abbbbbbbc


12) ^  => start of the line

^perl

^[abc]

^[^abc]


13) $   => end of the line

Perl$

[0-9]$


14) \d or [0-9]  => any single digit.

[0-9][0-9][0-9][0-9] or [0-9]{4} or \d\d\d\d or \d{4}


15) \D or [^0-9]  => any single non digit


16) \w or [a-zA-Z0-9_]  => any alphanumeric


17) \W or [^a-zA-Z0-9_]  => any non alphanumeric or special character


18) \s  => ’ ‘,’\t’,’\n’


19) \b =>word boundary


  • To avoid any confusion while dealing with regular expressions, we would use Raw Strings as r'expression'

 Functions in re module:


match( ): Match a regular expression pattern to the 

beginning of a string.

re.match(pattern, string, flags=0)


search( ): Search a string for the presence of a pattern.

re.search(pattern, string, flags=0)


sub( ): Substitute occurrences of a pattern found in a 

string.

re.sub(pattern, repl, string, max=0)


subn( ): Same as sub, but also return the number of 

substitutions made.


split( ): Split a string by the occurrences of a pattern.

re.split(pattern, string)


findall( ): Find all occurrences of a pattern in a string.

re.findall(pattern, string)


finditer( ): Return an iterator yielding a match object 

for each match.

re.finditer(pattern, string)


compile( ):  Compile a pattern into a RegexObject.

re.compile(pattern)

       

Matching Versus Searching


match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string .

Search and Replace


One of the most important re methods that use regular expressions is sub.
re.sub(pattern, repl, string, max=0)
This method replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided. This method returns modified string.
greedy matching:
The notion that the "+" and "*" characters in a regular expression expand outward to match the largest possible string. 
wild card:
A special character that matches any character. 
In regular expressions the wild card character is the period character.

character Classes:


In many cases, rather than matching one particular character we want to match any one of a set of characters. 

This can be achieved by using a character class—one or more characters enclosed in square brackets.

Symbol
Meaning
.
Matches any character except newline, any character at all with the re.DOTALL flag, or inside a character class matches a literal period.
\d
Matches a Unicode digit, or [0-9] with the re.ASCII flag.
\D
Matches a Unicode nondigit, or [^0-9] with the re.ASCII flag.
\s
Matches a Unicode whitespace, or [ \t\n\r\f\v] with the re.ASCII flag.
\S
Matches a Unicode non-whitespace, or [^ \t\n\r\f\v] with the re.ASCII flag.
\w
Matches a Unicode "word" character, or [a-zA-Z0-9_] with the re.ASCII flag.
\W
Matches a Unicode non-"word" character, or [^a-zA-Z0-9_] with the re.ASCII flag.

Quantifiers

A quantifier has the form {m,n} where m and n are the minimum and maximum times the expression to which the quantifier applies must match.
Syntax
Meaning
e? or e{0,1}
Greedily match zero occurrences or one occurrence of expression e.
e?? or e{0,1}?
Nongreedily match zero occurrences or one occurrence of expression e.
e+ or e{1,}
Greedily match one or more occurrences of expression e.
e+? or e{1,}?
Nongreedily match one or more occurrences of expression e.
e* or e{0,}
Greedily match zero or more occurrences of expression e.
e*? or e{0,}?
Nongreedily match zero or more occurrences of expression e.
e{m}
Match exactly m occurrences of expression e.
e{m,}
Greedily match at least m occurrences of expression e.
e{m,}?
Nongreedily match at least m occurrences of expression e.
e{,n}
Greedily match at most n occurrences of expression e.
e{,n}?
Nongreedily match at most n occurrences of expression e.
e{m,n}
Greedily match at least m and at most n occurrences of expression e.
e{m,n}?
Nongreedily match at least m and at most n occurrences of expression e.
Regular Expression Basics
.Any character except newline
aThe character a
abThe string ab
a|ba or b
a*0 or more a's
\Escapes a special character
Regular Expression Character Classes
[ab-d]One character of: a, b, c, d
[^ab-d]One character except: a, b, c, d
[\b]Backspace character
\dOne digit
\DOne non-digit
\sOne whitespace
\SOne non-whitespace
\wOne word character
\WOne non-word character
Regular Expression Flags
iIgnore case
m^ and $ match start and end of line
s. matches newline as well
xAllow spaces and comments
LLocale character classes
uUnicode character classes
(?iLmsux)Set flags within regex
Regular Expression Quantifiers
*0 or more
+1 or more
?0 or 1
{2}Exactly 2
{2, 5}Between 2 and 5
{2,}2 or more
(,5}Up to 5
Regular Expression Assertions
^Start of string
\AStart of string, ignores m flag
$End of string
\ZEnd of string, ignores m flag
\bWord boundary
\BNon-word boundary
(?=...)Positive lookahead
(?!...)Negative lookahead
(?<=...)Positive lookbehind
(?<!...)Negative lookbehind
(?()|)Conditional
Regular Expression Special Characters
\nNewline
\rCarriage return
\tTab
\YYYOctal character YYY
\xYYHexadecimal character YY
Regular Expression Groups
(...)Capturing group
(?P<Y>...)Capturing group named Y
(?:...)Non-capturing group
\YMatch the Y'th captured group
(?P=Y)Match the named group Y
(?#...)Comment
Regular Expression Replacement
\g<0>Insert entire match
\g<Y>Insert match Y (name or number)
\YInsert group numbered Y
Splitting Strings on Any of Multiple Delimiters
The split () method of string objects is really meant for very simple cases, and does not allow for multiple delimiters.
In cases when you need a bit more flexibility, use the re.split () method:
message="hai,siva krishna;hw r u"
import re
re.split(r'[,;\s]\s*',message)
['hai', 'siva', 'krishna', 'hw', 'r', 'u']



import re


regex=r"[a-zA-Z]+  \d+"


matches=re.findall(regex,"June 24,August 9,Oct 13,Dec")

for match in matches:

    print("Full match: ",match)



import re

regex=r"([a-zA-Z]+)  \d+"

matches=re.findall(regex,"June 24,August 9,Oct 13,Dec")

for match in matches:

    print("Full match: ",match)



import re

regex=r"[a-zA-Z]+ \d+"

matches1=re.findall(regex,"June 24,August 9,Oct 13,Dec")

matches2=re.finditer(regex,"June 24,August 9,Oct 13,Dec")

print(type(matches1))

print(type(matches2))

for match in matches1:

    print(match)

for match in matches2:

    print(match.start( ),match.end( ))



import re

regex=r"([a-zA-Z]+) (\d+)"

print(re.sub(regex,r"\2 of \1","June 24,August 9,Oct 13,Dec"))



import re

regex=r"([a-zA-Z]+) (\d+)"

x=re.sub(regex,r"\2 of \1","June 24,August 9,Oct 13,Dec")

regex1="\d+ [a-zA-Z]+ ([a-zA-Z]+)"

matches=re.findall(regex1,x)

for match in matches:

    print(match)



import re

regex=re.compile(r"(\w+) World")

result=regex.search("Hello World is the easiest")

print(result)

if result:

    print(result.start(),result.end())



import re

regex=re.compile(r"(\w+) World")

x=regex.findall("Hello World.Hai World")

for result in x:

    print(result)


import re

regex=re.compile(r"(\w+) World")

print(regex.sub(r"\1 Earth","Hello World"))


import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:

   print "matchObj.group( ) : ", matchObj.group( )

   print "matchObj.group(1) : ", matchObj.group(1)

   print "matchObj.group(2) : ", matchObj.group(2)

else:

   print "No match!!"


import re

line = "Cats are smarter than dogs";

searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:

   print "searchObj.group( ) : ", searchObj.group( )

   print "searchObj.group(1) : ", searchObj.group(1)

   print "searchObj.group(2) : ", searchObj.group(2)

else:

   print "Nothing found!!"



import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)

if matchObj:

   print "match --> matchObj.group( ) : ", matchObj.group( )

else:

   print "No match!!"

searchObj = re.search( r'dogs', line, re.M|re.I)

if searchObj:

   print "search --> searchObj.group( ) : ", searchObj.group( )

else:

   print "Nothing found!!"






No comments:

Post a Comment