library(tidyverse) # basic R useability
library(hunspell) # spell checker (and other stuff)
library(textcat) # language prediction
Gibberish Detector
Introduction
Surveys may contain responses that are gibberish - unintelligible or meaningless language. It would be useful to identify these entries automatically.
Basic Gibberish Detector
Our basic gibberish detector will calculate the ratio of real words in string.
Setup
<- tibble(
df id = c('english', 'ipsum', 'gibberish', 'spanish', 'quadratic'),
text = c(
'Contrary to popular belief, that is not simply random text.',
'Et harum quidem rerum facilis est et expedita distinctio',
'All mimsy were the borogoves, And the mome raths outgrabe',
'es un hermoso dia para ir a caminar en el parque',
'Negative b plus or minus the square root of B squared minus four ac over two a'
)
)
# display
|> knitr::kable() df
id | text |
---|---|
english | Contrary to popular belief, that is not simply random text. |
ipsum | Et harum quidem rerum facilis est et expedita distinctio |
gibberish | All mimsy were the borogoves, And the mome raths outgrabe |
spanish | es un hermoso dia para ir a caminar en el parque |
quadratic | Negative b plus or minus the square root of B squared minus four ac over two a |
Let’s create a simple function that returns the ratio of correct words in a given string.
<- function(input_text, show_output=FALSE){
word_ratio
if (show_output==TRUE){ glue::glue('input text: {input_text}') |> print()}
# replace '-' with space and remove all punctuation
<- input_text |>
temp str_replace("-", " ") |>
str_replace_all("[[:punct:]]", "")
if (show_output==TRUE){glue::glue('predicted language: {textcat(temp)}') |> print()}
# split string
<- str_split(temp,' ',simplify = TRUE)
temp
# spell check
<- hunspell_check(temp)
temp
# calc ratio
= length(temp[temp])/length(temp)
ratio_correct
if (show_output==TRUE){ glue::glue('correct/total words: {length(temp[temp])}/{length(temp)}') |> print()}
return(ratio_correct)
}
Let’s test it out.
= word_ratio(input_text = df$text[1], show_output = TRUE) ratio
input text: Contrary to popular belief, that is not simply random text.
predicted language: english
correct/total words: 10/10
print(ratio)
[1] 1
Let’s apply our function to our example tibble.
|>
df rowwise() |>
mutate(
ratio = word_ratio(text),
language = textcat(text),
gibberish = ifelse(ratio>0.75,FALSE,TRUE)
|>
) ::kable() knitr
id | text | ratio | language | gibberish |
---|---|---|---|---|
english | Contrary to popular belief, that is not simply random text. | 1.0000000 | english | FALSE |
ipsum | Et harum quidem rerum facilis est et expedita distinctio | 0.1111111 | latin | TRUE |
gibberish | All mimsy were the borogoves, And the mome raths outgrabe | 0.5000000 | english | TRUE |
spanish | es un hermoso dia para ir a caminar en el parque | 0.3636364 | spanish | TRUE |
quadratic | Negative b plus or minus the square root of B squared minus four ac over two a | 1.0000000 | english | FALSE |
Basic method advantages
- Lightweight solution with minimal computational effort
- Adaptable to different languages
Basic method shortcomings
- Dependent on spelling. If a customer inputs ‘computr’ instead of ‘computer’, then the basic algorithm will treat it as gibberish. There are methods for compensating for this, such as calculating suggested words through string-to-string edit distance.
- Will not detect gibberish that are real words, but nonsensical. For example ‘car shiny computer running tree’ doesn’t make any sense but will pass our test.
Advanced Gibberish Detectors
In progress
To address the above shortcomings, we could use a more advanced approach. Such as a 2 character markov chain to predict how often characters (or words) occur next to one another.
References
- https://github.com/domanchi/gibberish-detector
- https://github.com/rrenaud/Gibberish-Detector
- https://github.com/glender/gibber