Manipulating Data

Once a scientist has their data, they need to manipulate it.  Manipulation does NOT mean that a scientist makes the data say what they want it to say.  Rather, data manipulation focuses on cleaning up and organizing the data so the researcher can gain insights, see patterns, and get the data ready for further analysis.  Data manipulation is often referred to as "cleaning" by data scientists.  The annual Crowdflower report surveys data scientists and since 2015 has reported that cleaning data takes at least half of most data scientist's time.  In other words, data manipulation is an extremely important part of scientific inquiry!

 

Some activities that a scientist might engage in when cleaning data might be: looking for outliers, sorting data, filtering data, and standardizing data (i.e., getting it into the same units). For example, consider the following raw dataset from our gum preference study.  Are there any outliers? How would you sort it?  What insights can you gain from filtering the data (e.g., by gender or age)?

AGE GENDER PEPPERMINT CINNAMON WATERMELON BUBBLEGUM
SOUR PATCH candy
68 male easy to breathe good makes mouth water meh (plain) yuck
over 80 male good okay doesn't like not that bad good
over 75 female like yuck like too sweet like/ yum
over 70 male good/like good not bad & not good meh
doesn't like, least favorite
31 male interesting makes feel warm it's good boring really like it
48 female feels like shiny teeth/okay too spicy delicious and favorite so good too sour!
10 female yum and good eww (cough, cough) amazing so good good
47 male good likes it excelent and good so good good
65 female meh don't like like like absolutely love
19 female burns burns mild tingly sour
over 18 female cool pepperminty? hot cool, mild plesant sour?
22 male cool tingly tingly juicy peachy
over 20 male fresh minty nothing, hot, burns tart fruity sour tamerom
19 female minty tart, gross, sour fruity tingle good sour (eye-roll)
over 20 male minty tingles the jaw slow spurning flavor mild not tasty
16 male bitter somee what spicy fruity, tarp smooth tart
over 20 female minty and clean minty fruity bland sweet and sour
over 20 female surprising spice spunky typical bubblgum very tart
15 male don't like don't like meh don't like don't like
40 male refreshed likes it yummy don't like yummy!
14 male disappointed disappointed good disappointed good & bad
16 male it's okay good like it don't like yummy
9 male really likes it good likes it meh love it
54 male plesant, spicey really like the spicey not my favorite not my favorite disappointed
33 male fresh/cool spice lively and so sweet sweet
not sweet 'tastes like sour grape'
39 male makes breathe easier hot/fresh sweet to the bite wants to eat (so sweet) tart flavor
61 male fresh minty tingly really fruity fruity very sour
60 female refreshing not that hot/ mild cinnamon sour fruit fruitier the more you chew it
pucker extremly sour doesn't like
25 male minty...sharp delightful too sweet (tangy) reminds of childhood suprised look
22 female sweet minty cool spicy and burns tongue a little tart/sweet reminds of childhood/sweet
suprised look/delicious
11 female runny nose, breath easier with the throat burns tongue, and tastes good sweet, makes her feel pop meh sour, don't like
11 female breathes easier, clear nose burns, spicy, love super sweet, love it! plain meh
sour, tropical, likes

First off, most respondents gave their age in years.  While years can be treated as a continuous value, what if we wanted to compare groups of individuals?  Say, adults vs kids?  Or even kids, young adults, middle aged adults, and older adults?  In that case, we need to place each response into one of these categories. Also, notice that this data was created in a way that allowed individuals to give whatever response they wanted.  Are all responses usable, per our research question (i.e., what are people's responses?). What would the data look like if we were to convert people's open-ended responses to our aforementioned Likert scale?  After cleaning up the data, it might look something like this (sorted by youngest to oldest).  Note that we had to create a "can't tell" option because some of the open-ended responses only described the flavor instead of the individual's reaction (an issue that could be solved by a better data creation technique).

AGE GENDER PEPPERMINT CINNAMON WATERMELON BUBBLEGUM SOURPATCH
under 18 female really like really dislike really like really like like
under 18 male dislike can't tell can't tell can't tell can't tell
under 18 male dislike dislike meh dislike dislike
under 18 male dislike dislike like dislike meh
under 18 male meh like like dislike like
under 18 male really like like like meh really like
under 18 female like like like meh dislike
under 18 female like really like really like meh like
18-30 female can't tell can't tell can't tell can't tell can't tell
18-30 female can't tell can't tell can't tell like can't tell
18-30 male can't tell can't tell can't tell can't tell can't tell
18-30 male can't tell can't tell can't tell can't tell can't tell
18-30 female can't tell really dislike can't tell like can't tell
18-30 male can't tell can't tell can't tell can't tell dislike
18-30 female can't tell can't tell can't tell meh can't tell
18-30 female can't tell can't tell can't tell can't tell can't tell
18-30 male can't tell really like dislike like can't tell
18-30 female can't tell can't tell can't tell like really like
31-50 male can't tell can't tell like meh really like
31-50 female meh dislike really like really like really dislike
31-50 male like like really like really like like
31-50 male like like really like dislike really like
31-50 male can't tell can't tell can't tell can't tell can't tell
31-50 male like can't tell can't tell like can't tell
over 50 male can't tell like can't tell meh really dislike
over 50 male like meh dislike meh like
over 50 female like really dislike like dislike really like
over 50 male like like meh meh really dislike
over 50 female meh dislike like like really like
over 50 male like really like dislike meh really like
over 50 male can't tell can't tell can't tell can't tell can't tell
over 50 female like can't tell can't tell can't tell really dislike

 

This data is now starting to reveal some insights.  We can start to compare responses of different groups to the different flavors.

Think About It...

Here is an actual dataset of Download COVID-19 cases in India during the first half of 2020

.  What are some things you could do to clean up this dataset?   Do you notice any outliers? Are there columns of data that you don't need? How would you sort or filter the data?