I am trying to create a sentiment analysis model and I have a question.
After I preprocessed my tweets and created my vocabulary I've noticed that I have words that appear less than 5 times in my dataset (Also there are many of them that appear 1 time). Many of them are real words and not gibberish. My thinking is that if I keep those words then they will get wrong "sentimental" weights and gonna make my model worse.
Is my thinking right or am I missing something?
My vocab size is around 40000 words and those that are "rare" are around 10k.Should I "sacrifice" them?
Thanks in advance.