Tokenize Vectors into Words

This function transforms texts into words, calculate frequencies, supress stop words in a given language.

Usage

textTokenizer(
  text,
  exclude = NULL,
  lang = NULL,
  min_word_freq = 5,
  min_word_len = 2,
  keep_spaces = FALSE,
  lowercase = TRUE,
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_lettt = TRUE,
  laughs = TRUE,
  utf = TRUE,
  df = FALSE,
  h2o = FALSE,
  quiet = FALSE
)

Arguments

text: Character vector. Sentences or texts you wish to tokenize.
exclude: Character vector. Which words do you wish to exclude?
lang: Character. Language in text (used for stop words). Example: "spanish" or "english". Set to NA to ignore.
min_word_freq: Integer. This will discard words that appear less than <int> times. Defaults to 2. Set to NA to ignore.
min_word_len: Integer. This will discard words that have less than <int> characters. Defaults to 5. Set to NA to ignore.
keep_spaces: Boolean. If you wish to keep spaces in each line to keep unique compound words, separated with spaces, set to TRUE. For example, 'one two' will be set as 'one_two' and treated as a single word.
lowercase, remove_numbers, remove_punct: Boolean.
remove_lettt: Boolean. Repeated letters (more than 3 consecutive).
laughs: Boolean. Try to unify all laughs texts.
utf: Boolean. Transform all characters to UTF (no accents and crazy symbols)
df: Boolean. Return a dataframe with a one-hot-encoding kind of results? Each word is a column and returns if word is contained.
h2o: Boolean. Return H2OFrame?
quiet: Boolean. Keep quiet? If not, informative messages will be shown.

Value

data.frame. Tokenized words with counters.

Usage

Arguments

Value

See also