Friday, June 16, 2006

about text filtering

There are several approaches to the text filtering:
- searching of concrete words in text files - it is very simple to implement, and quick for processor;
- searching for regexps - also simple, when using external libraries, but more slow than simple text search;
- there is also another approach to the getting text category - using combination of words with weights. Using positive and negative weights values, we can select data, that match given category. This method is harder to implement, but it still simple. But, exist one problem - how to calculate weights for given words? Manual update of weights and words is hard problem, so it method is not good for use;
- another approach to the getting text category is in using something like Bayesian statistical text classification - we can use automatic text analyzing tools, by comparing texts in several categories and extracting needed information - words, weights, etc. This methid is slower, than others, but it provide good results, together with automatical informatiion extraction.

In our products we use some of these methods to classify texts - users can select which method to use, depending on tasks, that they want to do.

No comments: