Web Analytics

Regular Expressions

Advanced ~30 min read

Regular expressions (regex) are powerful pattern matching tools that let you search, extract, and manipulate text using complex patterns. Python's re module provides functions for working with regex patterns. Master regex and you can validate input, extract data, find/replace text, and process structured information efficiently!

Basic Pattern Matching

The re module provides several functions for pattern matching. The most common are re.search() (finds first match), re.findall() (finds all matches), and re.match() (matches at string start).

Output
Click Run to execute your code
Key Functions:
- re.search(pattern, string): Finds first match anywhere in string, returns Match object or None
- re.match(pattern, string): Matches only at string start, returns Match object or None
- re.findall(pattern, string): Finds all matches, returns list of strings
- re.finditer(pattern, string): Finds all matches, returns iterator of Match objects

Common Regex Patterns

Regex patterns use special characters and sequences to match text. Here are the most commonly used patterns:

Output
Click Run to execute your code
Pro Tip: Use raw strings (r"pattern") for regex patterns to avoid escaping backslashes. In raw strings, \d stays as \d instead of being interpreted as an escape sequence!

Groups and Capturing

Parentheses () create groups that capture parts of the match. You can access captured groups using the group() method on Match objects, or use them in substitutions.

Output
Click Run to execute your code

Substitution with re.sub

The re.sub() function replaces matches with replacement text. You can use captured groups in the replacement string using \1, \2, etc.

Output
Click Run to execute your code
Performance Tip: If you're using the same pattern multiple times, compile it first with re.compile(). This is faster than calling re.search() repeatedly with the same pattern!

Common Mistakes

1. Using match() instead of search()

# Wrong - match only works at start
import re

text = "Contact: [email protected]"
result = re.match(r"\S+@\S+", text)  # Returns None!
print(result)  # None - no match at start

# Correct - use search for anywhere in string
result = re.search(r"\S+@\S+", text)  # Finds email
print(result.group())  # '[email protected]'

2. Not using raw strings for patterns

# Wrong - backslashes need escaping
import re

# This tries to match tab character, not word boundary!
pattern = "\bword\b"  # \b is interpreted as backspace character
text = "a word here"
result = re.search(pattern, text)  # Won't work as expected

# Correct - use raw string
pattern = r"\bword\b"  # \b is word boundary
result = re.search(pattern, text)  # Works correctly
print(result.group())  # 'word'

3. Forgetting that search/match return Match objects or None

# Wrong - calling .group() on None crashes
import re

text = "No email here"
match = re.search(r"\S+@\S+", text)
email = match.group()  # AttributeError: 'NoneType' has no attribute 'group'

# Correct - check for None first
match = re.search(r"\S+@\S+", text)
if match:
    email = match.group()
    print(email)
else:
    print("No email found")

4. Greedy vs non-greedy matching

# Wrong - greedy matching takes too much
import re

text = "

First

and

Second

" # Greedy - matches from first < to last > match = re.search(r"

.*

", text) print(match.group()) # '

First

and

Second

' - too much! # Correct - use non-greedy *? match = re.search(r"

.*?

", text) print(match.group()) # '

First

' - just first match

Exercise: Extract Phone Numbers

Task: Create a function that extracts phone numbers from text. Handle both formats: (123) 456-7890 and 123-456-7890.

Requirements:

  • Import the re module
  • Create a function extract_phones(text) that finds all phone numbers
  • Use re.findall() to extract matches
  • Support formats: (123) 456-7890 and 123-456-7890
  • Return a list of all found phone numbers
  • Test with text containing multiple phone numbers
Output
Click Run to execute your code
Show Solution
import re

def extract_phones(text):
    """Extract phone numbers in formats (123) 456-7890 or 123-456-7890."""
    # Pattern matches: (optional area code in parens) digits-digits-digits
    pattern = r"\(?\d{3}\)?\s?-?\s?\d{3}-\d{4}"
    phones = re.findall(pattern, text)
    return phones


# Test the function
text = """
Contact us at (555) 123-4567 or 555-987-6543.
Our office is at 123-456-7890.
Call (888) 555-0000 for support.
"""

phones = extract_phones(text)
print("Found phone numbers:")
for phone in phones:
    print(f"  - {phone}")

Summary

  • re Module: Python's module for regular expression pattern matching
  • re.search(): Finds first match anywhere in string, returns Match object or None
  • re.match(): Matches only at string start, returns Match object or None
  • re.findall(): Finds all matches, returns list of strings or tuples (if groups)
  • re.sub(): Replaces matches with replacement text, supports group references like \1
  • Raw Strings: Use r"pattern" to avoid escaping backslashes
  • Groups: Use parentheses () to capture parts of matches, access with group()
  • Common Patterns: \d (digit), \w (word), \s (space), + (1+), * (0+), ? (0 or 1), {n,m} (range)
  • Compiled Regex: Use re.compile() for better performance when reusing patterns

What's Next?

Regular expressions are powerful tools for text processing! Next, we'll explore type hinting, which allows you to add type annotations to your Python code. While Python is dynamically typed, type hints improve code readability, enable better IDE support, and can catch errors with type checkers like mypy!