This function transforms texts into words, calculate frequencies, supress stop words in a given language.
Usage
textTokenizer(
text,
exclude = NULL,
lang = NULL,
min_word_freq = 5,
min_word_len = 2,
keep_spaces = FALSE,
lowercase = TRUE,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_lettt = TRUE,
laughs = TRUE,
utf = TRUE,
df = FALSE,
h2o = FALSE,
quiet = FALSE
)
Arguments
- text
Character vector. Sentences or texts you wish to tokenize.
- exclude
Character vector. Which words do you wish to exclude?
- lang
Character. Language in text (used for stop words). Example: "spanish" or "english". Set to
NA
to ignore.- min_word_freq
Integer. This will discard words that appear less than <int> times. Defaults to 2. Set to
NA
to ignore.- min_word_len
Integer. This will discard words that have less than <int> characters. Defaults to 5. Set to
NA
to ignore.- keep_spaces
Boolean. If you wish to keep spaces in each line to keep unique compound words, separated with spaces, set to TRUE. For example, 'one two' will be set as 'one_two' and treated as a single word.
- lowercase, remove_numbers, remove_punct
Boolean.
- remove_lettt
Boolean. Repeated letters (more than 3 consecutive).
- laughs
Boolean. Try to unify all laughs texts.
- utf
Boolean. Transform all characters to UTF (no accents and crazy symbols)
- df
Boolean. Return a dataframe with a one-hot-encoding kind of results? Each word is a column and returns if word is contained.
- h2o
Boolean. Return
H2OFrame
?- quiet
Boolean. Keep quiet? If not, print messages
See also
Other Data Wrangling:
balance_data()
,
categ_reducer()
,
cleanText()
,
date_cuts()
,
date_feats()
,
file_name()
,
formatHTML()
,
holidays()
,
impute()
,
left()
,
normalize()
,
num_abbr()
,
ohe_commas()
,
ohse()
,
quants()
,
removenacols()
,
replaceall()
,
replacefactor()
,
textFeats()
,
vector2text()
,
year_month()
,
zerovar()
Other Text Mining:
cleanText()
,
ngrams()
,
remove_stopwords()
,
replaceall()
,
sentimentBreakdown()
,
textCloud()
,
textFeats()
,
topics_rake()