Regular Expressions

Advanced ~30 min read

Regular expressions (regex) are powerful pattern matching tools that let you search, extract, and manipulate text using complex patterns. Python's re module provides functions for working with regex patterns. Master regex and you can validate input, extract data, find/replace text, and process structured information efficiently!

Basic Pattern Matching

The re module provides several functions for pattern matching. The most common are re.search() (finds first match), re.findall() (finds all matches), and re.match() (matches at string start).

# Basic Pattern Matching with re Module

import re

text = "The quick brown fox jumps over the lazy dog"

# re.search(): Finds first match anywhere in string
result = re.search(r"fox", text)
if result:
    print("Found 'fox' at position:", result.start())
    print("Match:", result.group())

# re.match(): Matches only at string start
result = re.match(r"The", text)
if result:
    print("\nMatch at start: 'The'")

result = re.match(r"quick", text)  # Won't match (not at start)
if not result:
    print("No match for 'quick' at start")

# re.findall(): Finds all matches
text2 = "cat bat hat mat"
matches = re.findall(r"at", text2)
print("\nAll matches of 'at':", matches)

# re.finditer(): Returns iterator of Match objects
print("\nUsing finditer:")
for match in re.finditer(r"\w+", text):
    print(f"Word: {match.group()}, Position: {match.start()}-{match.end()}")

# Finding patterns
email_text = "Contact us at support@example.com or sales@test.org"
email_pattern = r"\S+@\S+"
emails = re.findall(email_pattern, email_text)
print("\nFound emails:", emails)

# Matching phone numbers
phone_text = "Call (555) 123-4567 or 555-987-6543"
phone_pattern = r"$?\d{3}$?[- ]?\d{3}-\d{4}"
phones = re.findall(phone_pattern, phone_text)
print("Found phones:", phones)

Output

Click Run to execute your code

Key Functions:
- re.search(pattern, string): Finds first match anywhere in string, returns Match object or None
- re.match(pattern, string): Matches only at string start, returns Match object or None
- re.findall(pattern, string): Finds all matches, returns list of strings
- re.finditer(pattern, string): Finds all matches, returns iterator of Match objects

Common Regex Patterns

Regex patterns use special characters and sequences to match text. Here are the most commonly used patterns:

# Common Regex Patterns

import re

text = "Contact: email@example.com, Phone: (555) 123-4567, Date: 2024-01-15"

# Raw strings prevent backslash issues
print("Using raw strings (r'pattern'):")
pattern = r"\d{4}-\d{2}-\d{2}"  # Date pattern
dates = re.findall(pattern, text)
print("Dates:", dates)

# Common patterns
print("\nCommon regex patterns:")

# \d - digits
print("Digits:", re.findall(r"\d+", text))

# \w - word characters (letters, digits, underscore)
print("Words:", re.findall(r"\w+", text))

# \s - whitespace
spaces = re.findall(r"\s", text)
print(f"Whitespace count: {len(spaces)}")

# . - any character (except newline)
print("Any 3 chars:", re.findall(r"...", "abc123")[:3])  # First 3 matches

# ^ - start of string
print("Start match:", re.findall(r"^Contact", text))

# $ - end of string
print("End match:", re.findall(r"15$", text))

# * - zero or more
print("Zero or more digits:", re.findall(r"\d*", "abc123"))

# + - one or more
print("One or more digits:", re.findall(r"\d+", "abc123"))

# ? - zero or one
print("Optional digits:", re.findall(r"\d?", "abc123"))

# {n,m} - between n and m times
print("2-3 digits:", re.findall(r"\d{2,3}", "12 345 6789"))

# Character classes
print("\nCharacter classes:")
print("Digits:", re.findall(r"[0-9]+", text))
print("Letters:", re.findall(r"[a-zA-Z]+", text))
print("Vowels:", re.findall(r"[aeiou]", text))

# Negation [^...]
print("Non-digits:", re.findall(r"[^0-9\s]+", text)[:5])  # First 5

# Escaping special characters
special_text = "Price: $100.50"
print("\nEscaping special chars:")
print("Dollar amounts:", re.findall(r"\$\d+\.\d+", special_text))

Output

Click Run to execute your code

Pro Tip: Use raw strings (r"pattern") for regex patterns to avoid escaping backslashes. In raw strings, \d stays as \d instead of being interpreted as an escape sequence!

Groups and Capturing

Parentheses () create groups that capture parts of the match. You can access captured groups using the group() method on Match objects, or use them in substitutions.

# Groups and Capturing

import re

# Simple groups
text = "Contact: John Doe, Phone: (555) 123-4567"

# Capturing groups with parentheses
pattern = r"(\w+) (\w+)"
match = re.search(pattern, text)
if match:
    print("Full match:", match.group(0))  # "John Doe"
    print("Group 1 (first name):", match.group(1))  # "John"
    print("Group 2 (last name):", match.group(2))  # "Doe"
    print("All groups:", match.groups())

# Phone number with groups
phone_pattern = r"$(\d{3})$ (\d{3})-(\d{4})"
match = re.search(phone_pattern, text)
if match:
    area_code, first, last = match.groups()
    print(f"\nPhone number breakdown:")
    print(f"Area code: {area_code}")
    print(f"First part: {first}")
    print(f"Last part: {last}")

# Email parsing with groups
email_text = "Contact support@example.com or sales@test.org"
email_pattern = r"(\w+)@(\w+\.\w+)"
matches = re.findall(email_pattern, email_text)
print("\nEmail parsing:")
for username, domain in matches:
    print(f"Username: {username}, Domain: {domain}")

# Named groups (more readable)
pattern = r"(?P<area>\d{3})-(?P<first>\d{3})-(?P<last>\d{4})"
match = re.search(pattern, "555-123-4567")
if match:
    print("\nNamed groups:")
    print("Area:", match.group('area'))
    print("First:", match.group('first'))
    print("Last:", match.group('last'))
    print("Dict:", match.groupdict())

# Date parsing
date_text = "Today is 2024-01-15 and tomorrow is 2024-01-16"
date_pattern = r"(\d{4})-(\d{2})-(\d{2})"
dates = re.findall(date_pattern, date_text)
print("\nDate parsing:")
for year, month, day in dates:
    print(f"Year: {year}, Month: {month}, Day: {day}")

# Non-capturing groups (?:...)
text2 = "color colour"
# Match both spellings but don't capture the 'u'
pattern = r"colou?r"
matches = re.findall(pattern, text2)
print("\nNon-capturing (both spellings):", matches)

Output

Click Run to execute your code

Substitution with re.sub

The re.sub() function replaces matches with replacement text. You can use captured groups in the replacement string using \1, \2, etc.

# Substitution with re.sub

import re

text = "Contact: email@example.com or sales@test.org"

# Basic substitution
new_text = re.sub(r"@", "[at]", text)
print("Basic substitution:")
print(new_text)

# Using groups in replacement
phone_text = "Call (555) 123-4567"
# Reorder: (area) first-last -> area-first-last
new_phone = re.sub(r"$(\d{3})$ (\d{3})-(\d{4})", r"\1-\2-\3", phone_text)
print("\nReordering with groups:")
print(new_phone)

# Using named groups
email_text = "Contact support@example.com"
new_email = re.sub(r"(?P<user>\w+)@(?P<domain>\S+)", r"\g<user>[at]\g<domain>", email_text)
print("\nNamed groups in replacement:")
print(new_email)

# Date format conversion
dates_text = "2024-01-15 and 2024-12-25"
# Convert YYYY-MM-DD to MM/DD/YYYY
new_dates = re.sub(r"(\d{4})-(\d{2})-(\d{2})", r"\2/\3/\1", dates_text)
print("\nDate format conversion:")
print(new_dates)

# Function as replacement
def mask_email(match):
    """Mask email addresses."""
    email = match.group(0)
    parts = email.split('@')
    masked = parts[0][0] + '***@' + parts[1]
    return masked

emails_text = "Contact alice@example.com or bob@test.org"
masked = re.sub(r"\S+@\S+", mask_email, emails_text)
print("\nFunction replacement (masking):")
print(masked)

# Count parameter (replace first N occurrences)
text3 = "cat bat cat mat cat"
replaced = re.sub(r"cat", "dog", text3, count=2)  # Replace first 2
print("\nLimited replacements:")
print(replaced)

# Practical: Cleaning phone numbers
phone_numbers = "(555) 123-4567, 555-987-6543, (888) 555-0000"
# Normalize to: 555-123-4567
normalized = re.sub(r"$?(\d{3})$?[- ]?(\d{3})-(\d{4})", r"\1-\2-\3", phone_numbers)
print("\nPhone number normalization:")
print(normalized)

Output

Click Run to execute your code

Performance Tip: If you're using the same pattern multiple times, compile it first with re.compile(). This is faster than calling re.search() repeatedly with the same pattern!

Common Mistakes

1. Using match() instead of search()

# Wrong - match only works at start
import re

text = "Contact: [email protected]"
result = re.match(r"\S+@\S+", text)  # Returns None!
print(result)  # None - no match at start

# Correct - use search for anywhere in string
result = re.search(r"\S+@\S+", text)  # Finds email
print(result.group())  # '[email protected]'

2. Not using raw strings for patterns

# Wrong - backslashes need escaping
import re

# This tries to match tab character, not word boundary!
pattern = "\bword\b"  # \b is interpreted as backspace character
text = "a word here"
result = re.search(pattern, text)  # Won't work as expected

# Correct - use raw string
pattern = r"\bword\b"  # \b is word boundary
result = re.search(pattern, text)  # Works correctly
print(result.group())  # 'word'

3. Forgetting that search/match return Match objects or None

# Wrong - calling .group() on None crashes
import re

text = "No email here"
match = re.search(r"\S+@\S+", text)
email = match.group()  # AttributeError: 'NoneType' has no attribute 'group'

# Correct - check for None first
match = re.search(r"\S+@\S+", text)
if match:
    email = match.group()
    print(email)
else:
    print("No email found")

4. Greedy vs non-greedy matching

# Wrong - greedy matching takes too much
import re

text = "First
 and Second"
# Greedy - matches from first < to last >
match = re.search(r".*", text)
print(match.group())  # 'First
 and Second' - too much!

# Correct - use non-greedy *?
match = re.search(r".*?", text)
print(match.group())  # 'First
' - just first match

Exercise: Extract Phone Numbers

Task: Create a function that extracts phone numbers from text. Handle both formats: (123) 456-7890 and 123-456-7890.

Requirements:

Import the re module
Create a function extract_phones(text) that finds all phone numbers
Use re.findall() to extract matches
Support formats: (123) 456-7890 and 123-456-7890
Return a list of all found phone numbers
Test with text containing multiple phone numbers

Output

Click Run to execute your code

Show Solution

import re

def extract_phones(text):
    """Extract phone numbers in formats (123) 456-7890 or 123-456-7890."""
    # Pattern matches: (optional area code in parens) digits-digits-digits
    pattern = r"\(?\d{3}\)?\s?-?\s?\d{3}-\d{4}"
    phones = re.findall(pattern, text)
    return phones


# Test the function
text = """
Contact us at (555) 123-4567 or 555-987-6543.
Our office is at 123-456-7890.
Call (888) 555-0000 for support.
"""

phones = extract_phones(text)
print("Found phone numbers:")
for phone in phones:
    print(f"  - {phone}")

Summary

re Module: Python's module for regular expression pattern matching
re.search(): Finds first match anywhere in string, returns Match object or None
re.match(): Matches only at string start, returns Match object or None
re.findall(): Finds all matches, returns list of strings or tuples (if groups)
re.sub(): Replaces matches with replacement text, supports group references like \1
Raw Strings: Use r"pattern" to avoid escaping backslashes
Groups: Use parentheses () to capture parts of matches, access with group()
Common Patterns: \d (digit), \w (word), \s (space), + (1+), * (0+), ? (0 or 1), {n,m} (range)
Compiled Regex: Use re.compile() for better performance when reusing patterns

What's Next?

Regular expressions are powerful tools for text processing! Next, we'll explore type hinting, which allows you to add type annotations to your Python code. While Python is dynamically typed, type hints improve code readability, enable better IDE support, and can catch errors with type checkers like mypy!

Previous Context Managers

Next Type Hinting

Basic Pattern Matching

Common Regex Patterns

Groups and Capturing

Substitution with re.sub

Common Mistakes

1. Using match() instead of search()

2. Not using raw strings for patterns

3. Forgetting that search/match return Match objects or None

4. Greedy vs non-greedy matching

Exercise: Extract Phone Numbers

Summary

What's Next?

Enjoying these tutorials?