StringToWordVector

Package

weka.filters.unsupervised.attribute

Synopsis

Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).

Options

The table below describes the options available for StringToWordVector.

Option	Description
IDFTransform	Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document (instance) j.
TFTransform	Sets whether if the word frequencies should be transformed into: log(1+fij) where fij is the frequency of word i in document (instance) j.
attributeIndices	Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values. Specify an inclusive range with "-". E.g: "first-3,5,6-10,last".
attributeNamePrefix	Prefix for the created attribute names. (default: "")
doNotOperateOnPerClassBasis	If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
invertSelection	Set attribute selection mode. If false, only selected attributes in the range will be worked on; if true, only non-selected attributes will be processed.
lowerCaseTokens	If set then all the word tokens are converted to lower case before being added to the dictionary.
minTermFreq	Sets the minimum term frequency. This is enforced on a per-class basis.
normalizeDocLength	Sets whether if the word frequencies for a document (instance) should be normalized or not.
outputWordCounts	Output word counts rather than boolean 0 or 1(indicating presence or absence of a word).
periodicPruning	Specify the rate (x% of the input dataset) at which to periodically prune the dictionary. wordsToKeep prunes after creating a full dictionary. You may not have enough memory for this approach.
stemmer	The stemming algorithm to use on the words.
stopwords	The file containing the stopwords (if this is a directory then the default ones are used).
tokenizer	The tokenizing algorithm to use on the strings.
useStoplist	Ignores all the words that are on the stoplist, if set to true.
wordsToKeep	The number of words (per class if there is a class attribute assigned) to attempt to keep.

Capabilities

The table below describes the capabilites of StringToWordVector.

Capability	Supported
Class	No class, Relational class, Unary class, Binary class, Numeric class, Empty nominal class, Date class, Missing class values, Nominal class, String class
Attributes	Relational attributes, Empty nominal attributes, Date attributes, Binary attributes, String attributes, Missing values, Nominal attributes, Unary attributes, Numeric attributes
Min # of instances	0