Something I often end up copy/pasting between different Twitter bots I make is my “badwords” file, which contains a running list of words that I do not want my bots to say. As I’ve explained in detail in the past, the danger of running a Twitter bot that uses found text or random dictionary words is that it might say something really super inappropriate. While you can never have a perfect filtering system, my basic rule of thumb is that I don’t want my bots to say anything that I wouldn’t say myself. As such, I took the time to compile an (ever-growing) list of words that I just plain do not say, and neither will my bots. These tend to be what I call “words of oppression”, AKA racist/sexist/ableist words that I would never say.
So I’ve created wordfilter, an npm package that you can use in your own projects. Using it is easy. Install with:
$ npm install wordfilter
Use in Node.js as so:
var wordfilter = require('wordfilter'); wordfilter.blacklisted('does this string have a bad word in it?'); // "false"
The list of banned words is not all-inclusive, and I’m always adding words to it. If you’d like to file an issue or a pull request to add more words, please do so, but understand that this is primarily for use in my own projects, and I may not agree to add certain words. (For example, I have no problem with scatological words, so “shit” and “fuck” will never be on this list.)
Also note that due to the complexities of the English language, I am considering anything containing the substring of a bad word to be blacklisted. For example, even though “homogenous” is not a bad word, it contains the substring “homo” and it gets filtered. The reason for this is that new slang pops up all the time using compound words and I can’t possibly keep up with it. I’m willing to lose a few words like “homogenous” and “Pakistan” in order to avoid false negatives.
Comments on this entry are closed.