regex notes from Corey Schafer’s video

Yesterday I watched Corey Schafer’s video on regex in python and I took a ton of notes. I’ve cleaned them up and here’s what I learned!

regex notes


Here’s some sample code on how to search in a string using regular expressions. Below the code is an explanation on how to make more complicated regex strings.

text_to_search = """
feel free to type some big multi line text here
"""

pattern = re.compile(r'abc')
# this sets pattern to the re result of looking for the literal string 'abc'
# the 'r' tells it to not process the string,
# ie let escape chars through to regex.

matches = pattern.finditer(text_to_search)
# this applies the pattern to text_to_search and iterates through the results.
# findIter will iterate through the matches without filling up your memory

matches = pattern.findall(text_to_search)
# findall will return a string of all finds. it is iterable, but it is not a generator

matches = pattern.match(text_to_search)
# match only searches for a substring at the beginning of a string. not iterable.

matches = pattern.search(text_to_search)
# search will return the first occurrance of a substring in a string.

for match in matches:
print(match)


REGEX search TERMS
. – any character except newline
\d – any digit
\D – not a digit
\w – word character (a-z, A-Z, 0-9)
\s – whitespace (space, tab, newline)
\S – not whitespace

ANCHORS – let you specify placement in a string / word
\b – word boundary (ie space,tab, or newline.
this will change a search from searching within a word to searching solely for the start of a word)
\B – not a word character
^ – beginning of a string (actually the beginning of the string, not the beginning of an element)
$ – end of a string (actually the end of the string, not the end of an element)

CHARACTER SET
a set of characters to accept, either a range [a-Z] or a list [abcdefg]
You can also pass a character set with a string in brackets, ie [-.] will match one dash or dot
a dash in a character set at the beginning or end will mean a dash, any other place will mean a range between chars.
a ^ in a character set in the beginning negates the set. ie [^a-z] will return any chars that are not in a-z

QUANTIFIERS – let you select the number you want to match
* – 0 or more
+ – 1 or more
? – 0 or one
{3} – 3 exactly
{3,4} – range of numbers (min, max)

GROUPS – let you do logical ‘or’ searches across multiple words or characters
groups use parens to wrap the acceptable elements and pipes to separate them. (mr|ms|mrs) means match an mr or an ms or an mrs. You can access them by calling
for match in matches:
print(match.group(0))
# group (0)is the whole search
# group(any other number) is the paren in the search term

re.compile(r'M[a-z]+\.?\s[a-zA-Z]*')
this searches for 1 or more lowercase letters, then 0 or 1 periods, then whitespace, and finally 0 or more upper or lowercase letters.

BACK REFERENCING
this is is when you search for and return certain elements. It is used in conjunction with groups.


# more setup code above in order to make the pattern
subbed_urls = pattern.sub(r'/2/3', in_string)

This works slightly strangely. It will return group 2 and group 3 from every string where it finds a group 2 and group 3 element

FLAGS
Flags are a way to tell a search to use certain adjustments to a search
pattern = re.compile(r'start', re.IGNORECASE)
this tells re to ignore the case, accepting any combination of lowercase and uppercase letters

Tell me what you think.

This site uses Akismet to reduce spam. Learn how your comment data is processed.