(3) Experimental Data Quality and Error Checks – Practical

Step-by-step QC workflow: you must complete each step before moving to the next one.

Step 1 / 6

Step 1: Create messy dataset

Simulate realistic data issues.

df <- data.frame(
  plant_id = c(1,2,3,3,5,6),
  treatment = c("Control","Treatment","Treatmnt","Treatment","Control", NA),
  height_cm = c(12.0,13.2,-4.0,13.1,NA,12.4)
)

Step 2: Diagnose issues

Identify missing values, duplicates, and impossible values.

colSums(is.na(df))
sum(duplicated(df$plant_id))
unique(df$treatment)
summary(df$height_cm)

Step 3: Fix category typos

Standardize treatment labels.

df$treatment[df$treatment == "Treatmnt"] <- "Treatment"

Step 4: Remove duplicates

Keep first occurrence of duplicated IDs.

df <- df[!duplicated(df$plant_id), ]

Step 5: Remove impossible values

Filter out invalid negatives while keeping NAs.

df <- subset(df, height_cm >= 0 | is.na(height_cm))

Step 6: Re-check and log

Re-run checks and document fixes.

colSums(is.na(df))
summary(df$height_cm)
# QC log: typo fixed, duplicates removed, invalid values removed

Great work. You completed all steps in this practical.