Yesterday I watched Corey Schafer’s video on regex in python and I took a ton of notes. I’ve cleaned them up and here’s what I learned!
regex notes
Here’s some sample code on how to search in a string using regular expressions. Below the code is an explanation on how to make more complicated regex strings.
text_to_search = """
feel free to type some big multi line text here
"""
pattern = re.compile(r'abc')
# this sets pattern to the re result of looking for the literal string 'abc'
# the 'r' tells it to not process the string,
# ie let escape chars through to regex.
matches = pattern.finditer(text_to_search)
# this applies the pattern to text_to_search and iterates through the results.
# findIter will iterate through the matches without filling up your memory
matches = pattern.findall(text_to_search)
# findall will return a string of all finds. it is iterable, but it is not a generator
matches = pattern.match(text_to_search)
# match only searches for a substring at the beginning of a string. not iterable.
matches = pattern.search(text_to_search)
# search will return the first occurrance of a substring in a string.
for match in matches:
print(match)
REGEX search TERMS
. – any character except newline
\d – any digit
\D – not a digit
\w – word character (a-z, A-Z, 0-9)
\s – whitespace (space, tab, newline)
\S – not whitespace
ANCHORS – let you specify placement in a string / word
\b – word boundary (ie space,tab, or newline.
this will change a search from searching within a word to searching solely for the start of a word)
\B – not a word character
^ – beginning of a string (actually the beginning of the string, not the beginning of an element)
$ – end of a string (actually the end of the string, not the end of an element)
CHARACTER SET
a set of characters to accept, either a range [a-Z] or a list [abcdefg]
You can also pass a character set with a string in brackets, ie [-.] will match one dash or dot
a dash in a character set at the beginning or end will mean a dash, any other place will mean a range between chars.
a ^ in a character set in the beginning negates the set. ie [^a-z] will return any chars that are not in a-z
QUANTIFIERS – let you select the number you want to match
* – 0 or more
+ – 1 or more
? – 0 or one
{3} – 3 exactly
{3,4} – range of numbers (min, max)
GROUPS – let you do logical ‘or’ searches across multiple words or characters
groups use parens to wrap the acceptable elements and pipes to separate them. (mr|ms|mrs) means match an mr or an ms or an mrs. You can access them by calling
for match in matches:
print(match.group(0))
# group (0)is the whole search
# group(any other number) is the paren in the search term
re.compile(r'M[a-z]+\.?\s[a-zA-Z]*')
this searches for 1 or more lowercase letters, then 0 or 1 periods, then whitespace, and finally 0 or more upper or lowercase letters.
BACK REFERENCING
this is is when you search for and return certain elements. It is used in conjunction with groups.
# more setup code above in order to make the pattern
subbed_urls = pattern.sub(r'/2/3', in_string)
This works slightly strangely. It will return group 2 and group 3 from every string where it finds a group 2 and group 3 element
FLAGS
Flags are a way to tell a search to use certain adjustments to a search
pattern = re.compile(r'start', re.IGNORECASE)
this tells re to ignore the case, accepting any combination of lowercase and uppercase letters