Sk Learn Countvectorizer: Keeping Emojis As Words
I am using Sk Learn CountVectorizer on strings but CountVectorizer discards all the emojis in the text. For instance, 👋 Welcome should give us: ['\xf0\x9f\x91\x8b', 'welcome'] H
Solution 1:
yes, you are right! token_pattern
has to be changed. Instead of just alpha-numeric characters, we can make it as any character other than white space.
Try this!
from sklearn.feature_extraction.text import TfidfVectorizer
s= ['👋 Welcome', '👋 Welcome']
v = TfidfVectorizer(token_pattern=r'[^\s]+')
v.fit(s)
v.get_feature_names()
# ['welcome', '👋']
Solution 2:
Also there is a couple of packages out there that can transform emojis/emoticons into words directly e.g.
import emot
>>> text = "I love python 👨 :-)">>> emot.emoji(text)
[{'value': '👨', 'mean': ':man:', 'location': [14, 14], 'flag': True}]
>> import emoji
>> print(emoji.demojize('Python is 👍'))
Python is :thumbs_up:
Solution 3:
Try using the parameters CountVectorizer(analyzer = 'char', binary = True)
The docs say that: "token_pattern: Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'" see https://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html.
Also see this notebook: https://www.kaggle.com/kmader/toxic-emojis
Post a Comment for "Sk Learn Countvectorizer: Keeping Emojis As Words"