Group Project: Data cleaning assignment
- Due Oct 13, 2017 by 11:59pm
- Points 0
- Submitting a file upload
The data cleaning assignment has four components:
1. A detailed record of the changes made to the database and a corresponding record of database versions (this may be a sheet in the same Excel file as your codebook)
2. A fully complete and current vertical-format (transposed) codebook in either Word or Excel (This should be both printable and understandable for all parties)
Note: A complete codebook includes the following elements for each variable:
- Variable content (what information is presented here?)
- Variable source (data source, question number, etc.)
- Level of measurement (categorical, binary, ordinal, interval, text)
- Coding or units (what do the numbers mean? What units of measurement are being used?
- Notes (for example, is this variable a recode of another variable? Was there anything unusual about the measurement?)
- Variable name (alphanumeric, no spaces, preferable 8 characters or less)
3. The final version of your clean data in .csv format
To do this, in R Commander, select data/active data set/export active data set. Use commas as the separator values.
Note: A “clean” dataset has the following characteristics:
- The codebook associated with the data is complete and accurate
- The maximum and minimum values for all variables make sense
- The maximum and minimum values for all variables match your codebook
- Any missing data for which you can enter a known value or a reasonable and conservative estimate has been recovered (for example, if you ask someone if they are employed, and they say no, and you have skip logic asking those who are employed what their personal income is, there will be lots of missing data in the personal income variable. However, you know that anyone with missing data who indicated that they were unemployed probably has a personal income of 0).
- All observations with missing data that could not be recovered have been deleted, leaving an equal number of observations for every variable (Unless you have a clear justification for keeping observations with missing data--for example they are a special subset of the larger population).
- All factors (categorical, binary, and ordinal data) have been identified as factors.
- All ordinal data has been ordered, with higher values matching higher levels of the variable.
- Any data that is no longer useful (variables that have been recoded, observations with missing data, etc.) has been deleted. (Just remember to version!)
4. A summary of your data from R
- Import your data into R.
- In RCommander, under the MPA Statistics tab, select Descriptive statistics/Summarize dataset
- The output should appear in the RStudio console window.
- Copy the output and paste it into a Word document.
- Upload the document to Canvas.