Skip to content Skip to sidebar Skip to footer

Sk Learn Countvectorizer: Keeping Emojis As Words

I am using Sk Learn CountVectorizer on strings but CountVectorizer discards all the emojis in the text. For instance, 👋 Welcome should give us: ['\xf0\x9f\x91\x8b', 'welcome'] H

Solution 1:

yes, you are right! token_pattern has to be changed. Instead of just alpha-numeric characters, we can make it as any character other than white space.

Try this!

from sklearn.feature_extraction.text import TfidfVectorizer
s= ['👋 Welcome', '👋 Welcome']

v = TfidfVectorizer(token_pattern=r'[^\s]+')
v.fit(s)
v.get_feature_names()

# ['welcome', '👋']

Solution 2:

Also there is a couple of packages out there that can transform emojis/emoticons into words directly e.g.

import emot
>>> text = "I love python 👨 :-)">>> emot.emoji(text)
[{'value': '👨', 'mean': ':man:', 'location': [14, 14], 'flag': True}]

>> import emoji
>> print(emoji.demojize('Python is 👍'))
Python is :thumbs_up:

Solution 3:

Try using the parameters CountVectorizer(analyzer = 'char', binary = True)

The docs say that: "token_pattern: Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'" see https://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html.

Also see this notebook: https://www.kaggle.com/kmader/toxic-emojis

Post a Comment for "Sk Learn Countvectorizer: Keeping Emojis As Words"