Skip to contents

This function transforms texts into words, calculate frequencies, supress stop words in a given language.

Usage

textTokenizer(
  text,
  exclude = NULL,
  lang = NULL,
  min_word_freq = 5,
  min_word_len = 2,
  keep_spaces = FALSE,
  lowercase = TRUE,
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_lettt = TRUE,
  laughs = TRUE,
  utf = TRUE,
  df = FALSE,
  h2o = FALSE,
  quiet = FALSE
)

Arguments

text

Character vector. Sentences or texts you wish to tokenize.

exclude

Character vector. Which words do you wish to exclude?

lang

Character. Language in text (used for stop words). Example: "spanish" or "english". Set to NA to ignore.

min_word_freq

Integer. This will discard words that appear less than <int> times. Defaults to 2. Set to NA to ignore.

min_word_len

Integer. This will discard words that have less than <int> characters. Defaults to 5. Set to NA to ignore.

keep_spaces

Boolean. If you wish to keep spaces in each line to keep unique compound words, separated with spaces, set to TRUE. For example, 'one two' will be set as 'one_two' and treated as a single word.

lowercase, remove_numbers, remove_punct

Boolean.

remove_lettt

Boolean. Repeated letters (more than 3 consecutive).

laughs

Boolean. Try to unify all laughs texts.

utf

Boolean. Transform all characters to UTF (no accents and crazy symbols)

df

Boolean. Return a dataframe with a one-hot-encoding kind of results? Each word is a column and returns if word is contained.

h2o

Boolean. Return H2OFrame?

quiet

Boolean. Keep quiet? If not, print messages

Value

data.frame. Tokenized words with counters.