App Version Implemented
v1.5
Details
This is the PII module. Goal is to be able to redact any SSN, Email, and DoB strings from within files.
Notes
Priority
Medium
Status
Done
Social Security Numbers (SSN)
- Regex Pattern:
r'\b\d{3}-\d{2}-\d{4}\b'
- How it works:
\b
: This is a "word boundary." It ensures that we are matching a whole word or number, preventing it from finding an SSN inside a longer string of digits.\d{3}
: Looks for exactly three digits (\d
means any digit,{3}
means three of them).- : Matches the literal hyphen character.
\d{2}
: Looks for exactly two digits.- : Matches the second hyphen.
\d{4}
: Looks for exactly four digits.\b
: Another word boundary to mark the end.
This pattern specifically targets the classic XXX-XX-XXXX
format for Social Security Numbers.
Dates of Birth (DOB)
- Regex Pattern:
r'\b(0[1-9]|1[0-2])[-/](0[1-9]|[12]\d|3[01])[-/](\d{4}|\d{2})\b'
- How it works:
(0[1-9]|1[0-2])
: This part looks for the month. It matches a0
followed by a digit from 1 to 9 (for01
to09
) OR a1
followed by a digit from 0 to 2 (for10
,11
,12
).[-/]
: Matches either a hyphen or a forward slash as the separator.(0[1-9]|[12]\d|3[01])
: This handles the day. It looks for a0
followed by 1-9 OR a1
or2
followed by any digit OR a3
followed by a0
or1
. This covers days from01
to31
.[-/]
: Matches the second separator.(\d{4}|\d{2})
: This finds the year, matching either a four-digit year or a two-digit year.
This regex is flexible enough to find dates like MM-DD-YYYY
, MM/DD/YYYY
, MM-DD-YY
, and MM/DD/YY
.
Emails
- Regex Pattern:
r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
- How it works:
[A-Za-z0-9._%+-]+
: This matches the username part of the email. It allows one or more (+
) uppercase letters, lowercase letters, numbers, and the special characters._%+-
.@
: Matches the literal "@" symbol.[A-Za-z0-9.-]+
: This matches the domain name (like "gmail" or "yahoo"). It allows one or more letters, numbers, dots, and hyphens.\.
: Matches the literal dot before the top-level domain (like ".com").[A-Z|a-z]{2,}
: This looks for the top-level domain (TLD). It requires at least two ({2,}
) uppercase or lowercase letters.
This pattern effectively finds most standard email address formats.