(3) Experimental Data Quality and Error Checks – Practical
Step-by-step QC workflow: you must complete each step before moving to the next one.
Step 1 / 6
Step 1: Create messy dataset
Simulate realistic data issues.
df <- data.frame(
plant_id = c(1,2,3,3,5,6),
treatment = c("Control","Treatment","Treatmnt","Treatment","Control", NA),
height_cm = c(12.0,13.2,-4.0,13.1,NA,12.4)
)
Step 2: Diagnose issues
Identify missing values, duplicates, and impossible values.
colSums(is.na(df)) sum(duplicated(df$plant_id)) unique(df$treatment) summary(df$height_cm)
Step 3: Fix category typos
Standardize treatment labels.
df$treatment[df$treatment == "Treatmnt"] <- "Treatment"
Step 4: Remove duplicates
Keep first occurrence of duplicated IDs.
df <- df[!duplicated(df$plant_id), ]
Step 5: Remove impossible values
Filter out invalid negatives while keeping NAs.
df <- subset(df, height_cm >= 0 | is.na(height_cm))
Step 6: Re-check and log
Re-run checks and document fixes.
colSums(is.na(df)) summary(df$height_cm) # QC log: typo fixed, duplicates removed, invalid values removed
Great work. You completed all steps in this practical.
