I'd like to offer this jar containing 19 multi-class (1-of-n) text datasets, whose word count feature vectors have already been extracted. I thought it'd be good to post at the UCI repository and the WEKA datasets site, if you are interested. It's 14MB compressed. The problems come from LA Times, TREC, OHSUMED, etc. and the data were originally converted to word counts by Han, E. and Karypis, G. Centroid-Based Document Classification: Analysis & Experimental Results. In Proc. of the 4th European Conf. on the Principles of Data Mining and Knowledge Discovery (PKDD): 424-431, 2000. I have found them quite useful for studies, e.g. G. Forman & Ira Cohen. Learning from Little: Comparison of Classifiers Given Little Training ECML'04. Hewlett-Packard Labs TR HPL-2004-19R1. G. Forman. A Pitfall and Solution in Multi-Class Feature Selection for Text Classification. ICML'04. HPL-2004-86 G. Forman. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Special Issue on Variable and Feature Selection, Journal of Machine Learning Research, 3(Mar):1289-1305, 2003. HPL-2002-147R1. ((Their web-site has a subset of these datasets, but it only includes binary features--- word occurs 1 or more times.)) George Forman