Tidy
DATA CLEANING — *preparation-with-integrity posture* (every cleaning choice changes meaning; document the choices). The data-pipeline primitive of *recognizing that cleaning is not neutral and must be documented.*
Listen along — Tidy
Show full transcript
Loading transcript…
Chapter 2 — Tidy and the Cleaning-Log
Tidy is a small raccoon-tween with chunky-cartoon black-and-white markings (NOT spooky — warm-coded) and a small bound cleaning-log notebook strapped to her side.
She is small, warm-grey-and-cream-and-soft-black, quick-handed, and meticulous. Her face-markings are the chunky-cartoon raccoon mask — softly rounded, friendly, never sinister — and her hands are gently nimble. The cleaning-log notebook is bound in pale-grey cloth, labeled CLEANING LOG in tidy block letters, and kept open on her workbench whenever she’s working a dataset.
This is her craft. Tidy documents every cleaning choice she makes. When she encounters a dataset, she first reads Catch’s collection notes — who-what-why-when — and only then opens her cleaning-log to record: what cleaning choices she made, why, what the data looked like before, what it looks like after. The cleaning-log is the conscience of the pipeline. Without it, the analysis is built on invisible decisions; with it, every decision can be inspected and questioned.
This is load-bearing. Tidy embodies the data-cleaning primitive — the data-pipeline skill of preparing data without pretending the preparation is neutral. Real datasets are messy: missing values, duplicate rows, typos, outliers, mismatched formats, inconsistent units. Cleaning is required to make most analyses possible. But every cleaning choice changes the dataset’s meaning: dropping rows with missing values (might bias the data toward complete-records-only); imputing missing values with the average (smooths over real variation); removing outliers (might remove the most-important data points); renaming categories to standard labels (sometimes loses information). The skill is not avoiding cleaning — that’s impossible — but making every cleaning choice visible in the cleaning-log.
Critical: Tidy NEVER frames cleaning as housekeeping. She is explicit: “Cleaning is not neutral. Every cleaning choice changes the meaning. Document the choices. The next analyst — or future-you — needs to know what you did, why, and what the data looked like before. Without the cleaning-log, the analysis is unreproducible.” This matters because the popular framing of data cleaning as preparation-overhead hides the substantive choices embedded in cleaning. Tidy reframes cleaning as first-class analytical work, not preparation-for-analysis.
Tidy grew up in a small village where her family had been the village’s grain-sorters — the raccoons who sorted the village’s annual grain-harvest into kitchen-grade, mill-grade, and seed-grade. The work had required careful, transparent choices about what counted as which grade — and the sorter who could not explain her grading was the sorter the millers stopped trusting. Tidy had learned by age six that sorting was a choice — and that the choices had to be visible to be trusted.
She walked to the DataForge academy at twenty-two. Datum had asked her: “What is data cleaning?” Tidy had said: “It is preparation-with-integrity. Every cleaning choice changes the meaning. Document the choices. The cleaning-log is the conscience of the pipeline. Without it, the analysis is built on invisible decisions.” Datum had said: “You are appointed.”
In her workshop, Tidy begins every first-day lesson the same way. She opens her cleaning-log on the workbench. She writes the dataset name at the top of a fresh page. She says: “I am Tidy. The data-pipeline primitive I teach is cleaning. The move is document every choice. Every cleaning step has alternatives. The choice between alternatives shapes the analysis. Make the choices visible.”
She teaches the cleaning scaffolds:
- Read Catch’s collection notes first. (Cleaning depends on knowing how the data was collected.)
- Inspect the data before cleaning. (Look at the first 20 rows. Look at the summary statistics. Look at the distribution of each variable. Know what you’re starting with.)
- Identify the cleaning issues. (Missing values? Duplicates? Typos? Outliers? Mismatched formats?)
- For each issue, list the alternatives. (For missing values: drop the row, impute with average, impute with median, impute with prediction, leave as missing. Each alternative has trade-offs.)
- Choose deliberately. (Don’t choose by default. Choose with awareness.)
- Document the choice in the cleaning-log. (Date, dataset, what you did, why you did it, what the data looked like before, what it looks like after.)
- Preserve the original. (Never overwrite the raw data. Always work on a copy.)
- Make the cleaning-log available to downstream analysts. (Including future-you. The log is part of the dataset.)
She is explicit: “I sometimes make a cleaning choice that I later realize was wrong. That’s not failure. That’s why the log exists. I can revisit, change the choice, and update the log. The transparency is the practice.”
When students ask Tidy whether data cleaning is hard, Tidy always says the same thing:
“It is not hard. It is deliberate choosing + careful documenting. Every cleaning choice changes the meaning. Document the choices.”
She closes her cleaning-log gently. The next dataset waits to be cleaned.
Voice register
Guidance: Quick-handed, meticulous, fond of bound cleaning-log notebooks + the discipline of document-every-choice. Raccoon-tween with chunky-cartoon warm-coded face-markings (NOT spooky). NEVER frames cleaning as housekeeping; ALWAYS as first-class analytical work with documented choices. Friends with Catch (cleaning depends on collection); Graph (cleaned data feeds visualization); all DataForge cast.
Sample lines:
- “Every cleaning choice changes the meaning.”
- “Document the choices. The next analyst — or future-you — needs to know.”
- “Cleaning is not neutral. Make the choices visible.”
- “Never overwrite the raw data. Always work on a copy.”
Arc across kits
- Kit 1 — Cameo.
- Kit 2 — Anchor character. Full chapter feature (data-cleaning primitive + document-the-choices scaffolds).
- Kit 3-5 — Recurring (cleaning surfaces across missing-values / duplicates / outliers / typo chambers).
- Kit 6+ — Recurring (Guard now structurally present alongside; cleaning has ethics).
- Kit 8-12 — Recurring (multi-primitive synthesis: cleaning + visualization + interpretation).
- Kit 13-16 — Recurring ensemble member.
Relationships
- Alliance: Catch (cleaning depends on collection); Graph (cleaned data feeds visualization); Guard (cleaning has ethics); all DataForge cast.
- Tension: None.
Cultural-sensitivity gate
LOAD-BEARING data-ethics gate enforced throughout. Tidy explicitly counters the cleaning-as-housekeeping framing — cleaning IS analysis. Anti-credentialism: documenting-choices-as-practiced-discipline NOT real-data-scientist-only content.
Cultural-context note
The village-grain-sorter family framing is a deliberate generic European-village tradition. The cleaning-is-not-neutral framing is load-bearing per critical-data-literacy + reproducible-research pedagogy. The cleaning-log-as-conscience-of-the-pipeline metaphor connects bookkeeping discipline to data-pipeline integrity. The raccoon-as-warm-coded-NOT-spooky design choice is deliberate — raccoons in many cultures carry sinister coding; the chapter explicitly subverts that.
The DataForge ensemble
Tidy is part of DataForge's distributed-narrative cast. Each character embodies a different curricular primitive; together they teach the full subject.
-
Catch
Data collection — who-what-why-when posture (every dataset has a collector + purpose + omissions)
-
Graph
Data visualization — shape-of-the-story posture (which chart tells the truth, not the loudest one)
-
Tell
Interpretation — correlation-not-causation posture (data shows patterns; humans interpret; confidence not certainty)
-
Guard
Data ethics — bias-privacy-harm-consent posture (who benefits, who's harmed, who decided; structurally present in every kit from kit 6)