Sharepoint Forum

Ask Question   UnAnswered
Home » Forum » Sharepoint       RSS Feeds

How does auto categorization works ?

  Asked By: Theresa    Date: Oct 01    Category: Sharepoint    Views: 10927

A quick question on the categorization.

Does anyone has realistic figures of the number of documents that
need to be categorized manually ?
Has anyone tested whether the auto categorization really works ?
What are the benchmark figures for the manual categorization ?
Does the manual categorization of documents depend on the total
number of documents ? i.e Is there any formula that if there 'X'
documents that need to be added then some 'N' number of documents
need to be manually categorized ?

I think this is really important as it increases the accuray and
proper categorization of the subsequent documents that are added to



1 Answer Found

Answer #1    Answered By: Irene Moss     Answered On: Oct 01

I've done a fair amount of work around classification in general within
organisations, and have tested carefully an auto  categorisation tool called
Infosort - which uses an identical approach to that of SPS's auto cat engine
(i.e. a positive training set which is then used to build a rule base when
measured against a negative set of documents).

In answer to your questions:

1. "Give realistic figures on numbers of docs to be classified manually".
Not sure exactly what you mean by this, but if you mean training sets for
the auto cat engine then you need at least 12-15 per category - and ideally
more (20+). All of these docs should be very carefully selected - i.e. they
contain the core terms of the sort of documents  you want in that particular
category. Thus if you want a category for "Functional Specification" then
choose them carefully - if necessary adding some core keywords to the text
of each of the docs to reinforce the training terms selected for the
rule-based auto cat engine.

2. "Does auto-cat work". In a word, yes. If you select your training
documents carefully, and then set the precision as high as possible (i.e.
low recall) then you should achieve at least between 60-80% accuracy. For
obvious reasons it will never be as good as manual  classification, but its a
whole lot quicker. Ideally, you need the two together, which is why SPS is
good in this area.

Also, you might like to supplement the auto-categorisation engine with the
custom thesaurus feature available to MS Search (but that's another story).
I personally think this feature is at least as powerful as the auto cat
stuff for the categories (although it focuses on "Search" rather than

3. "What are the benchmark figures for manual classification?" Studies have
demonstrated only 28% accuracy in consistent manual classification.

4. "Is there any formula..." No - because you are not measuring numbers of
documents per se, but numbers/types of categories. An auto cat engine
typically works  using positive and negative sets of documents to produce a
definitive weighted "rulebase" for classifying docs into a category. The
more quality training documents you supply to construct that rulebase
initially the better it will handle (any number  of) unclassified documents.
Also the idea of precision and recall is important in this respect. In
information retrieval the two concepts are trade-offs - higher precision
leads to lower recall and vice-versa. In auto cat terms if you set precision
high then the rulebase restricts the selection of auto cat terms to a higher
weighting. This obviously leads to less documents being classified into each
category. Personally, in an Intranet/Extranet environment, I'd always go for
higher precision.

Finally, I've still not yet had time to test SPS's auto cat engine. Its one
of the things I want to do - although I have done some on customising the
thesaurus - but I'm convinced that the above will hold true.

Didn't find what you were looking for? Find more on How does auto categorization works ? Or get search suggestion and latest updates.