Gibberish Detector

Author

Ryan Keeney

Introduction

Surveys may contain responses that are gibberish - unintelligible or meaningless language. It would be useful to identify these entries automatically.

Basic Gibberish Detector

Our basic gibberish detector will calculate the ratio of real words in string.

Setup

library(tidyverse) # basic R useability 
library(hunspell) # spell checker (and other stuff)
library(textcat) # language prediction

df <- tibble(
  id = c('english', 'ipsum', 'gibberish', 'spanish', 'quadratic'),
  text = c(
    'Contrary to popular belief, that is not simply random text.',
    'Et harum quidem rerum facilis est et expedita distinctio',
    'All mimsy were the borogoves, And the mome raths outgrabe',
    'es un hermoso dia para ir a caminar en el parque',
    'Negative b plus or minus the square root of B squared minus four ac over two a'
  )
)

# display
df |> knitr::kable()

id	text
english	Contrary to popular belief, that is not simply random text.
ipsum	Et harum quidem rerum facilis est et expedita distinctio
gibberish	All mimsy were the borogoves, And the mome raths outgrabe
spanish	es un hermoso dia para ir a caminar en el parque
quadratic	Negative b plus or minus the square root of B squared minus four ac over two a

Let’s create a simple function that returns the ratio of correct words in a given string.

word_ratio <- function(input_text, show_output=FALSE){
  
  if (show_output==TRUE){ glue::glue('input text: {input_text}') |> print()}

  # replace '-' with space and remove all punctuation 
  temp <- input_text |> 
    str_replace("-", " ") |> 
    str_replace_all("[[:punct:]]", "")
  
  if (show_output==TRUE){glue::glue('predicted language: {textcat(temp)}') |> print()}
  
  # split string
  temp <- str_split(temp,' ',simplify = TRUE)
  
  # spell check
  temp <- hunspell_check(temp)
  
  # calc ratio
  ratio_correct = length(temp[temp])/length(temp)
  
  if (show_output==TRUE){  glue::glue('correct/total words: {length(temp[temp])}/{length(temp)}') |> print()}

  return(ratio_correct)
}

Let’s test it out.

ratio = word_ratio(input_text = df$text[1], show_output = TRUE)

input text: Contrary to popular belief, that is not simply random text.
predicted language: english
correct/total words: 10/10

print(ratio)

[1] 1

Let’s apply our function to our example tibble.

df |> 
  rowwise() |> 
  mutate(
    ratio = word_ratio(text),
    language = textcat(text),
    gibberish = ifelse(ratio>0.75,FALSE,TRUE)
    ) |> 
  knitr::kable()

id	text	ratio	language	gibberish
english	Contrary to popular belief, that is not simply random text.	1.0000000	english	FALSE
ipsum	Et harum quidem rerum facilis est et expedita distinctio	0.1111111	latin	TRUE
gibberish	All mimsy were the borogoves, And the mome raths outgrabe	0.5000000	english	TRUE
spanish	es un hermoso dia para ir a caminar en el parque	0.3636364	spanish	TRUE
quadratic	Negative b plus or minus the square root of B squared minus four ac over two a	1.0000000	english	FALSE

Basic method advantages

Lightweight solution with minimal computational effort
Adaptable to different languages

Basic method shortcomings

Dependent on spelling. If a customer inputs ‘computr’ instead of ‘computer’, then the basic algorithm will treat it as gibberish. There are methods for compensating for this, such as calculating suggested words through string-to-string edit distance.
Will not detect gibberish that are real words, but nonsensical. For example ‘car shiny computer running tree’ doesn’t make any sense but will pass our test.

Advanced Gibberish Detectors

In progress

To address the above shortcomings, we could use a more advanced approach. Such as a 2 character markov chain to predict how often characters (or words) occur next to one another.

References

https://github.com/domanchi/gibberish-detector
https://github.com/rrenaud/Gibberish-Detector
https://github.com/glender/gibber