How To Treat Number With Decimals Or With Commas As One Word In Countvectorizer

I am cleaning text and then passing it to the CountVectorizer function to give me a count of how many times each word appears in the text. The problem is that it is treating 10,000

Solution 1:

The default regex pattern the tokenizer is using for the token_pattern parameter is:


So a word is defined by a \b word boundary at the beginning and the end with \w\w+ one alphanumeric character followed by one or more alphanumeric characters between the boundaries. To interpret the regex, the backslashes have to be escaped by \\.

So you could change the token pattern to:


Explanation: [\\.,]?allows for the optional appearance of a . or ,. The regex for the first appearing alphanumeric character \w has to be extended to \w+ to match numbers with more than one digit before the punctuation.

For your slightly adjusted example:

corpus=["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer(token_pattern='\\b(\\w+[\\.,]?\\w+)\\b')
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))


10,000x  2.5  am  bet  in  lightning  many  na  re  release  spins  strike  there  userna  

Alternatively you could modify your input text, e.g. by replacing the decimal point .with underscore _ and removing commas standing between digits.

import re

corpus = ["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
for i in range(len(corpus)):
    corpus[i] = re.sub("(\d+)\.(\d+)", "\\1_\\2", corpus[i]) 
    corpus[i] = re.sub("(\d+),(\d+)", "\\1\\2", corpus[i])
analyzer =CountVectorizer().build_analyzer()
vectorizer =CountVectorizer()
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))


10000x  2_5  am  bet  in  lightning  many  na  re  release  spins  strike  there  userna

