Manipulating Data
Once a scientist has their data, they need to manipulate it. Manipulation does NOT mean that a scientist makes the data say what they want it to say. Rather, data manipulation focuses on cleaning up and organizing the data so the researcher can gain insights, see patterns, and get the data ready for further analysis. Data manipulation is often referred to as "cleaning" by data scientists. The annual Crowdflower report surveys data scientists and since 2015 has reported that cleaning data takes at least half of most data scientist's time. In other words, data manipulation is an extremely important part of scientific inquiry!
Some activities that a scientist might engage in when cleaning data might be: looking for outliers, sorting data, filtering data, and standardizing data (i.e., getting it into the same units). For example, consider the following raw dataset from our gum preference study. Are there any outliers? How would you sort it? What insights can you gain from filtering the data (e.g., by gender or age)?
AGE | GENDER | PEPPERMINT | CINNAMON | WATERMELON | BUBBLEGUM |
SOUR PATCH candy
|
68 | male | easy to breathe | good | makes mouth water | meh (plain) | yuck |
over 80 | male | good | okay | doesn't like | not that bad | good |
over 75 | female | like | yuck | like | too sweet | like/ yum |
over 70 | male | good/like | good | not bad & not good | meh |
doesn't like, least favorite
|
31 | male | interesting | makes feel warm | it's good | boring | really like it |
48 | female | feels like shiny teeth/okay | too spicy | delicious and favorite | so good | too sour! |
10 | female | yum and good | eww (cough, cough) | amazing | so good | good |
47 | male | good | likes it | excelent and good | so good | good |
65 | female | meh | don't like | like | like | absolutely love |
19 | female | burns | burns | mild | tingly | sour |
over 18 | female | cool pepperminty? | hot | cool, mild | plesant | sour? |
22 | male | cool | tingly | tingly | juicy | peachy |
over 20 | male | fresh minty | nothing, hot, burns | tart | fruity | sour tamerom |
19 | female | minty | tart, gross, sour | fruity tingle | good | sour (eye-roll) |
over 20 | male | minty | tingles the jaw | slow spurning flavor | mild | not tasty |
16 | male | bitter | somee what spicy | fruity, tarp | smooth | tart |
over 20 | female | minty and clean | minty | fruity | bland | sweet and sour |
over 20 | female | surprising | spice | spunky | typical bubblgum | very tart |
15 | male | don't like | don't like | meh | don't like | don't like |
40 | male | refreshed | likes it | yummy | don't like | yummy! |
14 | male | disappointed | disappointed | good | disappointed | good & bad |
16 | male | it's okay | good | like it | don't like | yummy |
9 | male | really likes it | good | likes it | meh | love it |
54 | male | plesant, spicey | really like the spicey | not my favorite | not my favorite | disappointed |
33 | male | fresh/cool | spice | lively and so sweet | sweet |
not sweet 'tastes like sour grape'
|
39 | male | makes breathe easier | hot/fresh | sweet to the bite | wants to eat (so sweet) | tart flavor |
61 | male | fresh minty | tingly | really fruity | fruity | very sour |
60 | female | refreshing | not that hot/ mild cinnamon | sour fruit | fruitier the more you chew it |
pucker extremly sour doesn't like
|
25 | male | minty...sharp | delightful | too sweet (tangy) | reminds of childhood | suprised look |
22 | female | sweet minty cool | spicy and burns tongue | a little tart/sweet | reminds of childhood/sweet |
suprised look/delicious
|
11 | female | runny nose, breath easier with the throat | burns tongue, and tastes good | sweet, makes her feel pop | meh | sour, don't like |
11 | female | breathes easier, clear nose | burns, spicy, love | super sweet, love it! | plain meh |
sour, tropical, likes
|
First off, most respondents gave their age in years. While years can be treated as a continuous value, what if we wanted to compare groups of individuals? Say, adults vs kids? Or even kids, young adults, middle aged adults, and older adults? In that case, we need to place each response into one of these categories. Also, notice that this data was created in a way that allowed individuals to give whatever response they wanted. Are all responses usable, per our research question (i.e., what are people's responses?). What would the data look like if we were to convert people's open-ended responses to our aforementioned Likert scale? After cleaning up the data, it might look something like this (sorted by youngest to oldest). Note that we had to create a "can't tell" option because some of the open-ended responses only described the flavor instead of the individual's reaction (an issue that could be solved by a better data creation technique).
AGE | GENDER | PEPPERMINT | CINNAMON | WATERMELON | BUBBLEGUM | SOURPATCH |
---|---|---|---|---|---|---|
under 18 | female | really like | really dislike | really like | really like | like |
under 18 | male | dislike | can't tell | can't tell | can't tell | can't tell |
under 18 | male | dislike | dislike | meh | dislike | dislike |
under 18 | male | dislike | dislike | like | dislike | meh |
under 18 | male | meh | like | like | dislike | like |
under 18 | male | really like | like | like | meh | really like |
under 18 | female | like | like | like | meh | dislike |
under 18 | female | like | really like | really like | meh | like |
18-30 | female | can't tell | can't tell | can't tell | can't tell | can't tell |
18-30 | female | can't tell | can't tell | can't tell | like | can't tell |
18-30 | male | can't tell | can't tell | can't tell | can't tell | can't tell |
18-30 | male | can't tell | can't tell | can't tell | can't tell | can't tell |
18-30 | female | can't tell | really dislike | can't tell | like | can't tell |
18-30 | male | can't tell | can't tell | can't tell | can't tell | dislike |
18-30 | female | can't tell | can't tell | can't tell | meh | can't tell |
18-30 | female | can't tell | can't tell | can't tell | can't tell | can't tell |
18-30 | male | can't tell | really like | dislike | like | can't tell |
18-30 | female | can't tell | can't tell | can't tell | like | really like |
31-50 | male | can't tell | can't tell | like | meh | really like |
31-50 | female | meh | dislike | really like | really like | really dislike |
31-50 | male | like | like | really like | really like | like |
31-50 | male | like | like | really like | dislike | really like |
31-50 | male | can't tell | can't tell | can't tell | can't tell | can't tell |
31-50 | male | like | can't tell | can't tell | like | can't tell |
over 50 | male | can't tell | like | can't tell | meh | really dislike |
over 50 | male | like | meh | dislike | meh | like |
over 50 | female | like | really dislike | like | dislike | really like |
over 50 | male | like | like | meh | meh | really dislike |
over 50 | female | meh | dislike | like | like | really like |
over 50 | male | like | really like | dislike | meh | really like |
over 50 | male | can't tell | can't tell | can't tell | can't tell | can't tell |
over 50 | female | like | can't tell | can't tell | can't tell | really dislike |
This data is now starting to reveal some insights. We can start to compare responses of different groups to the different flavors.
Think About It...
Here is an actual dataset of COVID-19 cases in India during the first half of 2020 Download COVID-19 cases in India during the first half of 2020. What are some things you could do to clean up this dataset? Do you notice any outliers? Are there columns of data that you don't need? How would you sort or filter the data?