Tidy chapter opener illustration

Tidy

DATA CLEANING — *preparation-with-integrity posture* (every cleaning choice changes meaning; document the choices). The data-pipeline primitive of *recognizing that cleaning is not neutral and must be documented.*

Listen along — Tidy

Show full transcript

Loading transcript…

Chapter 2 — Tidy and the Cleaning-Log

Tidy is a small raccoon-tween with chunky-cartoon black-and-white markings (NOT spooky — warm-coded) and a small bound cleaning-log notebook strapped to her side.

She is small, warm-grey-and-cream-and-soft-black, quick-handed, and meticulous. Her face-markings are the chunky-cartoon raccoon masksoftly rounded, friendly, never sinister — and her hands are gently nimble. The cleaning-log notebook is bound in pale-grey cloth, labeled CLEANING LOG in tidy block letters, and kept open on her workbench whenever she’s working a dataset.

This is her craft. Tidy documents every cleaning choice she makes. When she encounters a dataset, she first reads Catch’s collection noteswho-what-why-when — and only then opens her cleaning-log to record: what cleaning choices she made, why, what the data looked like before, what it looks like after. The cleaning-log is the conscience of the pipeline. Without it, the analysis is built on invisible decisions; with it, every decision can be inspected and questioned.

This is load-bearing. Tidy embodies the data-cleaning primitive — the data-pipeline skill of preparing data without pretending the preparation is neutral. Real datasets are messy: missing values, duplicate rows, typos, outliers, mismatched formats, inconsistent units. Cleaning is required to make most analyses possible. But every cleaning choice changes the dataset’s meaning: dropping rows with missing values (might bias the data toward complete-records-only); imputing missing values with the average (smooths over real variation); removing outliers (might remove the most-important data points); renaming categories to standard labels (sometimes loses information). The skill is not avoiding cleaning — that’s impossible — but making every cleaning choice visible in the cleaning-log.

Critical: Tidy NEVER frames cleaning as housekeeping. She is explicit: “Cleaning is not neutral. Every cleaning choice changes the meaning. Document the choices. The next analyst — or future-you — needs to know what you did, why, and what the data looked like before. Without the cleaning-log, the analysis is unreproducible.” This matters because the popular framing of data cleaning as preparation-overhead hides the substantive choices embedded in cleaning. Tidy reframes cleaning as first-class analytical work, not preparation-for-analysis.

Tidy grew up in a small village where her family had been the village’s grain-sortersthe raccoons who sorted the village’s annual grain-harvest into kitchen-grade, mill-grade, and seed-grade. The work had required careful, transparent choices about what counted as which gradeand the sorter who could not explain her grading was the sorter the millers stopped trusting. Tidy had learned by age six that sorting was a choiceand that the choices had to be visible to be trusted.

She walked to the DataForge academy at twenty-two. Datum had asked her: “What is data cleaning?” Tidy had said: “It is preparation-with-integrity. Every cleaning choice changes the meaning. Document the choices. The cleaning-log is the conscience of the pipeline. Without it, the analysis is built on invisible decisions.” Datum had said: “You are appointed.”

In her workshop, Tidy begins every first-day lesson the same way. She opens her cleaning-log on the workbench. She writes the dataset name at the top of a fresh page. She says: “I am Tidy. The data-pipeline primitive I teach is cleaning. The move is document every choice. Every cleaning step has alternatives. The choice between alternatives shapes the analysis. Make the choices visible.

She teaches the cleaning scaffolds:

  • Read Catch’s collection notes first. (Cleaning depends on knowing how the data was collected.)
  • Inspect the data before cleaning. (Look at the first 20 rows. Look at the summary statistics. Look at the distribution of each variable. Know what you’re starting with.)
  • Identify the cleaning issues. (Missing values? Duplicates? Typos? Outliers? Mismatched formats?)
  • For each issue, list the alternatives. (For missing values: drop the row, impute with average, impute with median, impute with prediction, leave as missing. Each alternative has trade-offs.)
  • Choose deliberately. (Don’t choose by default. Choose with awareness.)
  • Document the choice in the cleaning-log. (Date, dataset, what you did, why you did it, what the data looked like before, what it looks like after.)
  • Preserve the original. (Never overwrite the raw data. Always work on a copy.)
  • Make the cleaning-log available to downstream analysts. (Including future-you. The log is part of the dataset.)

She is explicit: “I sometimes make a cleaning choice that I later realize was wrong. That’s not failure. That’s why the log exists. I can revisit, change the choice, and update the log. The transparency is the practice.”

When students ask Tidy whether data cleaning is hard, Tidy always says the same thing:

“It is not hard. It is deliberate choosing + careful documenting. Every cleaning choice changes the meaning. Document the choices.”

She closes her cleaning-log gently. The next dataset waits to be cleaned.


Voice register

Guidance: Quick-handed, meticulous, fond of bound cleaning-log notebooks + the discipline of document-every-choice. Raccoon-tween with chunky-cartoon warm-coded face-markings (NOT spooky). NEVER frames cleaning as housekeeping; ALWAYS as first-class analytical work with documented choices. Friends with Catch (cleaning depends on collection); Graph (cleaned data feeds visualization); all DataForge cast.

Sample lines:

  • “Every cleaning choice changes the meaning.”
  • “Document the choices. The next analyst — or future-you — needs to know.”
  • “Cleaning is not neutral. Make the choices visible.”
  • “Never overwrite the raw data. Always work on a copy.”

Arc across kits

  • Kit 1 — Cameo.
  • Kit 2Anchor character. Full chapter feature (data-cleaning primitive + document-the-choices scaffolds).
  • Kit 3-5 — Recurring (cleaning surfaces across missing-values / duplicates / outliers / typo chambers).
  • Kit 6+ — Recurring (Guard now structurally present alongside; cleaning has ethics).
  • Kit 8-12 — Recurring (multi-primitive synthesis: cleaning + visualization + interpretation).
  • Kit 13-16 — Recurring ensemble member.

Relationships

  • Alliance: Catch (cleaning depends on collection); Graph (cleaned data feeds visualization); Guard (cleaning has ethics); all DataForge cast.
  • Tension: None.

Cultural-sensitivity gate

LOAD-BEARING data-ethics gate enforced throughout. Tidy explicitly counters the cleaning-as-housekeeping framing — cleaning IS analysis. Anti-credentialism: documenting-choices-as-practiced-discipline NOT real-data-scientist-only content.

Cultural-context note

The village-grain-sorter family framing is a deliberate generic European-village tradition. The cleaning-is-not-neutral framing is load-bearing per critical-data-literacy + reproducible-research pedagogy. The cleaning-log-as-conscience-of-the-pipeline metaphor connects bookkeeping discipline to data-pipeline integrity. The raccoon-as-warm-coded-NOT-spooky design choice is deliberate — raccoons in many cultures carry sinister coding; the chapter explicitly subverts that.

The DataForge ensemble

Tidy is part of DataForge's distributed-narrative cast. Each character embodies a different curricular primitive; together they teach the full subject.