ruby - Extract and replace emails and phone number from string



I need to hide emails and phone number in a string. Replacing well formatted emails/number is easy with a regex, but what about other formats? Here is an example:


Email addresses likeemail@example.comoremail AT example DOT comshould be replaced. Phone numbers like347 323 4567ortree four seven, three two three four five six sevenshould also be replace.


Email addresses like(email hidden)or(email hidden)should be replaced. Phone numbers like(phone hidden)or(phone hidden)should also be replace.

AirBnB's messaging system is really good at doing that. Apparently they used to do that:

It looks for @ symbols, spellings of “this is me AT whatever DOT com” and series of numbers with at least 7 digits (telephone number) with some sensitivity to separators.

What would be the best way to do the same thing? Writing complex regexes? Using a natural language processing library?

1 Answer: 

This isn't going to be easy to do in code, and can have unpleasant consequences for your users, then your customer support people.

Phone numbers can be entered in a large number of formats if you allow for international numbers.

123-446-7890 could be a phone number, or it could be a simple subtraction like x=123-456-7890. Imagine how irritated your user will be when they get x=(phone hidden).

Email addresses are an even harder problem because they can vary in all sorts of ways. You can get the specification for email addresses by reading RFC 2822, and there's always the one in Perl's Mail::RFC822::Address module. While most people try to validate an address using a pattern, merely locating them can be ugly.

In either case, there are regex patterns that attempt to do it but they all fail when pushed hard.

To me this sounds like an ill-conceived idea made by an unknowing executive, similar to the request

Write a filter that removes all dirty words.

that I once received. (Yeah, right. From all written and spoken languages on earth, or merely man's desire to use such words?) It's easy to work around, and, for a lot of people will be a challenge just to defeat it.