#Transformation

Table

Symbol Description Example
. Matches any single character except newline a.c matches "abc", "axc"
^ Matches the start of a string ^abc matches "abc" in "abcxyz"
$ Matches the end of a string abc$ matches "abc" in "xyzabc"
[] Matches any one of the characters inside the brackets [aeiou] matches "a", "e", "i"
` ` OR operator, matches either the left or right expression
() Groups expressions, captures the match (abc)+ matches "abc" or "abcabc"
* Matches 0 or more of the preceding element a*b matches "b", "ab", "aaab"
+ Matches 1 or more of the preceding element a+b matches "ab", "aaab", but not "b"
? Matches 0 or 1 of the preceding element a?b matches "b" or "ab"
{n} Matches exactly n occurrences of the preceding element a{3} matches "aaa"
{n,} Matches n or more occurrences of the preceding element a{2,} matches "aa", "aaa", "aaaa"
{n,m} Matches between n and m occurrences of the element a{2,4} matches "aa", "aaa", or "aaaa"
\d Matches any digit (0-9) \d{3} matches "123"
\D Matches any non-digit \D matches "a", "b", etc.
\w Matches any word character (alphanumeric + underscore) \w+ matches "hello", "word_123"
\W Matches any non-word character \W matches "!", "#", etc.
\s Matches any whitespace character (space, tab, newline) \s+ matches one or more spaces or tabs
\S Matches any non-whitespace character \S matches "a", "1", etc.
\b Matches a word boundary \bword\b matches "word", but not "sword"
\B Matches a non-word boundary \Bword\B matches "word" in "sword", "wording"
\\ Escapes special characters \\. matches a literal dot
\ Escape character for metacharacters \. matches a literal period

1. rlike()

from pyspark.sql.functions import col

df = df.filter(col("value").rlike(r"^\d{3}"))  # Filters rows where 'value' starts with 3 digits

2. regexp_replace()

from pyspark.sql.functions import regexp_replace, col

df = df.withColumn("cleaned_value", regexp_replace(col("value"), r"\d", "*"))  # Replace digits with '*'