library(readr)
library(quanteda)
library(quanteda.textplots)
ufo_sightings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-06-20/ufo_sightings.csv')
places <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-06-20/places.csv')
day_parts_map <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-06-20/day_parts_map.csv')
This week’s Tidy Tuesday dataset contains data on UFO sightings across the world (though the USA has by far the largest number of UFO sightings of any country).
# top 10 countries by UFO sightings
summary(as.factor(places$country))[1:10]
## USA United Kingdom Canada Australia India
## 12063 552 445 185 95
## Netherlands Germany Brazil Mexico New Zealand
## 64 60 47 46 38
In addition to the time and place the UFO was seen, the data also contains the duration of the sighting and a description or account of the sighting. It is these accounts that I find most interesting for today’s project.
head(ufo_sightings$summary)
## [1] "Saw multi color object above horizon."
## [2] "An object in the shape of a straight line about an inch from our viewing area moving slowly across the sky"
## [3] "Tone in the air."
## [4] "Black tic-tac shaped ufo. Moved with insane speed"
## [5] "Two alien were scanning me"
## [6] "Long cigar solid shaped craft with light beam"
To analyze the text data contained in the UFO sighting summaries, I will rely on the quanteda package, which has a clearly defined workflow: raw text to corpus to tokens to document feature matrix (dfm). Along the way I will remove any punctuation, symbols, numbers, URLs, and separators from the text. I will also make all the text lowercase and remove English stopwords as well as very short words. (Stopwords are words that contain no useful semantic content, i.e. they are meaningless. For example, “the,” “you,” be,” “on,” etc.) I will then stem the words, i.e. each word gets decomposed into its word root. This helps ensure that words like “alien” (singular) and “aliens” (plural) get treated the same way during analysis and visualization. Finally, I will weight the words based on their TF-IDF score (Term Frequency-Inverse Document Frequency). This score essentially penalizes words that are very common across all the UFO sighting accounts so that these very common words don’t “hog” all the space in subsequent analysis and visualization.
# quanteda workflow
ufo_corp <- corpus(ufo_sightings$summary)
ufo_toks <- tokens(ufo_corp, remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE)
ufo_dfm <- dfm(ufo_toks, tolower = TRUE)
ufo_dfm <- dfm_remove(ufo_dfm, pattern = stopwords('en'))
ufo_dfm <- dfm_select(ufo_dfm, min_nchar = 3)
ufo_dfm <- dfm_wordstem(ufo_dfm)
ufo_tfidf_dfm <- dfm_tfidf(ufo_dfm)
Before proceeding, let’s visualize the most common words in these UFO sighting accounts using a word cloud.
#topfeatures(ufo_tfidf_dfm, 20)
textplot_wordcloud(ufo_tfidf_dfm, min_count=1500, color=RColorBrewer::brewer.pal(8, "Paired"))
One aspect of these texts that immediately catches my eye is the frequency of color descriptors within these accounts. It makes sense, when observing a UFO, that color would be one of the clearest and most available ways to describe what one is seeing. “Red,” “orang,” and “white” appear most frequently within the texts, as evidenced by their large size within the word cloud. As a next step, I’d like to further investigate how colors are used to describe UFO sightings.
I’ll spare you the details on this one. I examined the top 500 words in the text using the topfeatures() function and identified any color terms that appeared there. The most used color terms are (in order of frequency): white, orange, red, green, blue, black, yellow, silver, amber, reddish, grey, yellowish, gray.
I’d like to know what exactly people are describing as “amber” or “yellowish,” so I need to explore the context around these color terms. I’m especially interest in the word that comes right after the color, since the color likely modifies that word, e.g. “a red light.” To do this I can use quanteda’s kwic() (key word in context) function. In a few cases, such as “yellow” and “yellowish,” I’ll combine the two outputs of the kwic() function since these two colors are more-or-less the same.
white_words <- kwic(ufo_toks, "white", window=1)
orange_words <- kwic(ufo_toks, "orange", window=1)
red_words <- kwic(ufo_toks, "red", window=1)
red_words <- rbind(red_words, kwic(ufo_toks, "reddish", window=1))
green_words <- kwic(ufo_toks, "green", window=1)
blue_words <- kwic(ufo_toks, "blue", window=1)
black_words <- kwic(ufo_toks, "black", window=1)
yellow_words <- kwic(ufo_toks, "yellow", window=1)
yellow_words <- rbind(yellow_words, kwic(ufo_toks, "yellowish", window=1))
silver_words <- kwic(ufo_toks, "silver", window=1)
amber_words <- kwic(ufo_toks, "amber", window=1)
gray_words <- kwic(ufo_toks, "gray", window=1)
gray_words <- rbind(gray_words, kwic(ufo_toks, "grey", window=1))
head(gray_words)
## Keyword-in-context with 6 matches.
## [text183, 20] other | gray | white
## [text683, 1] | gray | beings
## [text1021, 9] a | gray | being
## [text1200, 7] silver | gray | parasail
## [text1280, 5] witness | gray | disc
## [text1328, 17] feet | Gray | top
To show a brief example, the kwic() function produces a data frame that identifies which texts contain the word “gray,” as well as one word before and one word after “gray.” I could have added additional words before/after using the window parameter, but in this case I’m really only interested in the one word that comes after the color term. Above I can see that the word “gray” often describes “beings,” i.e. aliens…?
With the work done so far, it’s quite easy to generate word clouds for each color, so that I can visualize what each color term most frequently describes. I will simply repeat the quanteda workflow from before for each color.
# create custom function to repeat quanteda workflow
prep_text <- function(df, my_color=NULL){
text_corp <- corpus(df$post)
text_toks <- tokens(text_corp)
text_dfm <- dfm(text_toks, tolower = TRUE)
text_dfm <- dfm_remove(text_dfm, pattern = stopwords('en'))
text_dfm <- dfm_select(text_dfm, min_nchar = 3)
text_dfm <- dfm_wordstem(text_dfm)
text_dfm <- dfm_tfidf(text_dfm)
if(is.null(my_color)){
my_color <- strsplit(deparse(substitute(df)),"_")[[1]][1]
}
textplot_wordcloud(text_dfm, color=my_color, min_size=1, max_size=3, min_count=10)
mtext(paste("Words described as", strsplit(deparse(substitute(df)),"_")[[1]][1], "in UFO sightings"),
side=3, line=3, at=-0.07, adj=0, cex=1)
}
prep_text(df=white_words, my_color="antiquewhite3")
prep_text(df=orange_words)
prep_text(df=red_words)
prep_text(df=green_words, my_color="green4")
prep_text(df=blue_words)
prep_text(df=black_words)
prep_text(df=yellow_words, my_color="gold")
prep_text(df=silver_words, my_color="snow4")
prep_text(df=amber_words, my_color="goldenrod")
prep_text(df=gray_words, my_color="gray40")
The word clouds show some common trends. For instance, almost all the colors are most frequently used to describe light. The exceptions to this are black, gray, and silver. These colors are more often used to describe objects (esp. disks and spheres), and black, interestingly, is frequently used to describe triangles. After some Googling, I’ve discovered that “black triangle” UFOs appear to be a common enough phenomenon that there is a Wikipedia article dedicated to them.
The colors people use when describing UFO sightings are intriguing. It appears that many UFO sightings involve colored lights, with the occasional gray/silver object or black triangle witnessed.