I am looking for a tool that performs counting of words and, more importantly, phrases, in large amounts of open-ended text responses. I need the ability to exclude certain words (a, the, and, etc.) as well.
I am aware of a few tools that do this:
- http://www.mywritertools.com/default.asp - http://www.hermetic.ch/wfca/wfca.htm
As well as some lists of available text mining software
- http://en.wikipedia.org/wiki/List_of_text_mining_software - http://academic.csuohio.edu/kneuendorf/content/cpuca/qtap.htm - http://www.predictiveanalyticstoday.com/top-30-software-for-text-analysis-text-mining-text-analytics/
Most of these either a) cost money, or b) provide much more/different functionality than I need. I am not opposed to paying a moderate amount (< $100) for a decent tool, but am hoping to get some input first to avoid buying something that doesn't meet my needs.
1) currently resides in a SQL database, but can transformed into whatever format needed (text file, excel, whatever)
2) contains an opened ended response, and a category id relating to a specific product or type of product (eg. either "soda" or "pepsi")
1) Ability to count common words and phrases
2) Ability to exclude a list of words (a, the, and, etc.) such that "wash car" and "wash the car" would count as the same phrase
Would be nice to have
1) Ability to match based on root word so that "wash the car", "washed the car" and "washes the car" all match
2) Ability to see what words appear near each other so that I can get a count of the times that "wash car", "wash the car" and "car wash" appear as a single count.
Icing on the cake
1) Ability to do counts based on categories. Not a big deal as the number of categories is relatively low and I can run each individually, but this may change in the future.
Please share any advice/experience/suggestions! Also, I am not opposed to writing my own tool, but don't want to reinvent the wheel. In the absence of a specific tool, any libraries that may assist in doing this (especially for root word matching), would also be appreciated.