How To Remove Everything Except Words And Emoji From Text?
As a part of text classification problem I am trying to clean a text dataset. So far I was removing everything except text. Punctuation, numbers, emoji - everything was removed. No
Solution 1:
You may join the two steps into one using a single regex and a lambda expression inside a re.sub
like this:
import re
emoji_pat = '[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]'
shrink_whitespace_reg = re.compile(r'\s{2,}')
defclean_text(raw_text):
reg = re.compile(r'({})|[^a-zA-Z]'.format(emoji_pat)) # line a
result = reg.sub(lambda x: ' {} '.format(x.group(1)) if x.group(1) else' ', raw_text)
return shrink_whitespace_reg.sub(' ', result)
text = 'I am very #happy man! but😘😘 my wife😞 is not 😊😘. 99/33'print('Cleaned text: ' + clean_text(text))
# => Cleaned text: I am very happy man but 😘 😘 my wife 😞 is not 😊 😘
See the Python demo
Explanation:
- The first regex will look like
([\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF])|[^A-Za-z]
and will match and capture into Group 1 an emoji or will just match any char other than an ASCII letter. If the emoji was captured (seeif x.group(1)
inside the lambda), the emoji will be returned back enclosed with spaces on both sides, else, the space will be used to replace a non-letter - The
\s{2,}
pattern will match 2 or more whitespaces andshrink_whitespace_reg.sub(' ', result)
will replace all these chunks with a single whitespace.
Post a Comment for "How To Remove Everything Except Words And Emoji From Text?"