My environment: 8GB Ram Notebook with Ubuntu 14.04, Solr 4.3.1, carrot2workbench 3.10.0
My Solr Index: 15980 documents
My Problem: Cluster all documents with the kmeans algorithm
When I drop off the query in the carrot2workbench (query: :), I always get a Java heap size error when using more than ~1000 Results. I started Solr with -Xms256m -Xmx6g but it still occurs.
Is it really a heap size problem or could it be somewhere else?
Best How To :
Your suspicion is correct, it is a heap size problem, or more precisely, a scalability constraint. Straight from the carrot2 FAQs: http://project.carrot2.org/faq.html#scalability
How does Carrot2 clustering scale with respect to the number and length of documents? The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, Carrot2 should successfully deal with up to a thousand of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project.
A developer also posted about this here: http://stackoverflow.com/a/28991477
While the developers recommend Mahout, and this is probably the way to go since you would not be bound by the in-memory clustering constraints as in carrot2, there might be other possibilities, though:
If you really like carrot2 but do not necessarily need k-means, you could take a look at the commercial Lingo3G, based on the "Time of clustering 100000 snippets [s] " field and the (***) remark on http://carrotsearch.com/lingo3g-comparison it should be able to tackle more documents. Check also their FAQ entry on "What is the maximum number of documents Lingo3G can cluster?" on http://carrotsearch.com/lingo3g-faq
Try to minimize the size of your labels on which k-means is performing the clustering. Instead of clustering over all the documents content, try to cluster on the abstract/summary or extract important keywords and cluster on them.