Application 2: Clean Data

Due May 9, 2017 by 5pm
Points 100
Submitting a file upload
Available until May 30, 2017 at 11:59pm

This assignment was locked May 30, 2017 at 11:59pm.

Document on data cleaning: 14021001 Cleaning data with R.docx Download 14021001 Cleaning data with R.docx

The data cleaning assignment has four components, contained in a total of two files. One file, Excel format, should contain the data, codebook, and research notes (items 1-3). The second file, in Word format, should contain a copy of the data summary from R (see the end of these instructions for sample submission files):

1. A copy of your complete and clean dataset in an Excel document (titled and versioned) in a sheet entitled "data"

A “clean” dataset has the following characteristics:

The codebook associated with the data is complete and accurate
The maximum and minimum values for all variables make sense
The maximum and minimum values for all variables match your codebook
Any missing data for which you can enter a known value or a reasonable and conservative estimate has been recovered (for example, if you ask someone if they are employed, and they say no, and you have skip logic asking those who are employed what their personal income is, there will be lots of missing data in the personal income variable. However, you know that anyone with missing data who indicated that they were unemployed probably has a personal income of 0).
All observations with missing data that could not be recovered have been deleted, leaving an equal number of observations for every variable (Unless you have a clear justification for keeping observations with missing data--for example they are a special subset of the larger population).
All ordinal data is ordered with higher values matching higher levels of the variable and no non-ordinal values included.
Any data that is no longer useful (variables that have been recoded, observations with missing data, etc.) has been deleted.

2. A fully complete and current vertical-format (transposed) codebook in a sheet entitled "codebook"

A complete codebook includes the following elements for each variable:

Variable content (what information is presented here?)
Variable source (data source, question number, etc.)
Level of measurement (categorical, binary, ordinal, interval, text)
Coding or units (what do the numbers mean? What units of measurement are being used?
Notes (for example, is this variable a recode of another variable? Was there anything unusual about the measurement?)
Variable name (alphanumeric, no spaces, preferable 8 characters or less)

A complete codebook has the following characteristics

Complete (no cells missing or blank)
Accurate (matches the data)
Descriptive (contains all necessary information)

3. A detailed record of the changes made to the database and a corresponding record of database versions in a sheet entitled "research notes"

Complete research notes include notes on the following:

Variable and observation recodes
Variable and observation imputation
Find and replace functions
Calculation of new variables
Variable and observation deletions
Importing of data from new sources
Other notes you find pertinent

4. A summary of your data from R (in a Word document)

Import your data into R.
In RCommander, under the MPA Statistics tab, select Descriptive statistics/Summarize dataset
The output should appear in the RStudio console window. If possible, size the window so all output appears on one set of rows.
Copy the output and paste it into a Word document.
Select all text and format using a fixed-width font (such as Courier or Courier New)
Save file and upload the document to Canvas.

HERE IS AN EXAMPLE OF WHAT YOUR ASSIGNMENT SHOULD LOOK LIKE (in two files; see all sheets):

classpracticedata20170504V1.xlsx Download classpracticedata20170504V1.xlsx

summary stats for practice dataset.docx Download summary stats for practice dataset.docx