Skip to content

Scan for SSNs and other kinds of PII by content #42

@simsong

Description

@simsong

I learned about this program recently at a data science conference.

From examining your source code, it seems that you are mostly detected possible PII by column names, rather than by doing a content examination. With some work you could scan for PII by content. For example, you could have regular expressions that scan for SSNs, phone numbers, email addresses, and the like.

You can find many such regular expressions in the bulk_extractor open source project, which is Named Entity Recognizer that is used for processing digital evidence. The bulk_extractor program uses flex as a high-speed RE parser. With not a lot of work, you could actually take the bulk_extractor shared library and call it from R directly. Or you could manually take out the regular expressions from its files and add them directly here. It would be slower but easier to maintain.

Here are the files of interest:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions