Skip to content Skip to sidebar Skip to footer

How To Remove Everything Except Words And Emoji From Text?

As a part of text classification problem I am trying to clean a text dataset. So far I was removing everything except text. Punctuation, numbers, emoji - everything was removed. No

Solution 1:

You may join the two steps into one using a single regex and a lambda expression inside a re.sub like this:

import re

emoji_pat = '[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]'
shrink_whitespace_reg = re.compile(r'\s{2,}')

defclean_text(raw_text):
    reg = re.compile(r'({})|[^a-zA-Z]'.format(emoji_pat)) # line a
    result = reg.sub(lambda x: ' {} '.format(x.group(1)) if x.group(1) else' ', raw_text)
    return shrink_whitespace_reg.sub(' ', result)

text = 'I am very #happy man! but😘😘 my wife😞 is not 😊😘. 99/33'print('Cleaned text: ' + clean_text(text))
# => Cleaned text: I am very happy man but 😘 😘 my wife 😞 is not 😊 😘

See the Python demo

Explanation:

  • The first regex will look like ([\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF])|[^A-Za-z] and will match and capture into Group 1 an emoji or will just match any char other than an ASCII letter. If the emoji was captured (see if x.group(1) inside the lambda), the emoji will be returned back enclosed with spaces on both sides, else, the space will be used to replace a non-letter
  • The \s{2,} pattern will match 2 or more whitespaces and shrink_whitespace_reg.sub(' ', result) will replace all these chunks with a single whitespace.

Post a Comment for "How To Remove Everything Except Words And Emoji From Text?"