REGEXP

Table

Symbol	Description	Example
`.`	Matches any single character except newline	`a.c` matches "abc", "axc"
`^`	Matches the start of a string	`^abc` matches "abc" in "abcxyz"
`$`	Matches the end of a string	`abc$` matches "abc" in "xyzabc"
`[]`	Matches any one of the characters inside the brackets	`[aeiou]` matches "a", "e", "i"
`	`	OR operator, matches either the left or right expression
`()`	Groups expressions, captures the match	`(abc)+` matches "abc" or "abcabc"
`*`	Matches 0 or more of the preceding element	`a*b` matches "b", "ab", "aaab"
`+`	Matches 1 or more of the preceding element	`a+b` matches "ab", "aaab", but not "b"
`?`	Matches 0 or 1 of the preceding element	`a?b` matches "b" or "ab"
`{n}`	Matches exactly `n` occurrences of the preceding element	`a{3}` matches "aaa"
`{n,}`	Matches `n` or more occurrences of the preceding element	`a{2,}` matches "aa", "aaa", "aaaa"
`{n,m}`	Matches between `n` and `m` occurrences of the element	`a{2,4}` matches "aa", "aaa", or "aaaa"
`\d`	Matches any digit (0-9)	`\d{3}` matches "123"
`\D`	Matches any non-digit	`\D` matches "a", "b", etc.
`\w`	Matches any word character (alphanumeric + underscore)	`\w+` matches "hello", "word_123"
`\W`	Matches any non-word character	`\W` matches "!", "#", etc.
`\s`	Matches any whitespace character (space, tab, newline)	`\s+` matches one or more spaces or tabs
`\S`	Matches any non-whitespace character	`\S` matches "a", "1", etc.
`\b`	Matches a word boundary	`\bword\b` matches "word", but not "sword"
`\B`	Matches a non-word boundary	`\Bword\B` matches "word" in "sword", "wording"
`\\`	Escapes special characters	`\\.` matches a literal dot
`\`	Escape character for metacharacters	`\.` matches a literal period

1. `rlike()`

Purpose: It is used to check if a string column matches a given regular expression pattern.
Usage: Returns True if the string in the column matches the pattern, and False otherwise.
Syntax: -

from pyspark.sql.functions import col

df = df.filter(col("value").rlike(r"^\d{3}"))  # Filters rows where 'value' starts with 3 digits

2. `regexp_replace()`

Purpose: It is used to replace substrings in a string column that match a given regular expression with a specified replacement string.
Usage: Returns a new column with replaced values.
Syntax: -

from pyspark.sql.functions import regexp_replace, col

df = df.withColumn("cleaned_value", regexp_replace(col("value"), r"\d", "*"))  # Replace digits with '*'

Table

1. rlike()

2. regexp_replace()

1. `rlike()`

2. `regexp_replace()`