Tidy

DATA CLEANING — *preparation-with-integrity posture* (every cleaning choice changes meaning; document the choices). The data-pipeline primitive of *recognizing that cleaning is not neutral and must be documented.*

Listen along — Tidy

Loading audio…

Press play to listen along. The line being read lights up as you go.

Show full transcript

Loading transcript…

Chapter 2 — Tidy and the Cleaning-Log

Tidy is a small raccoon. She’s a tween, not quite grown up. Her fur is warm grey, cream, and soft black. She has chunky black-and-white markings on her face. They look like a friendly mask, not spooky at all. Her hands are quick and gentle. She’s very careful with everything she does.

Tidy always has a small notebook with her. It’s her cleaning-log. The notebook is bound in pale grey cloth. “CLEANING LOG” is written on it in neat block letters. She keeps it open on her workbench. This happens whenever she works with data. It’s always right there, ready to go.

This is Tidy’s special job. She writes down every cleaning choice she makes. When new data arrives, she does two things. First, she reads Catch’s notes. Catch writes down who collected the data. She notes what it is, why, and when. Only after that does Tidy open her cleaning-log. She writes down her choices. She explains why she made them. She records what the data looked like before. Then she writes what it looks like after.

This cleaning-log is super important. It’s like the data’s memory. Without it, nobody knows why things changed. People can’t check her work. But with the log, every choice is clear. Anyone can look at it and ask questions.

This job is really important. Tidy shows us the skill of data-cleaning. It means getting data ready. But she knows it’s not just a simple fix. Real data is often a big mess. It has missing numbers. There are repeated lines. Typos show up everywhere. Some numbers are way too big or small. They are called outliers. Formats don’t match. Units are all mixed up.

Cleaning is needed to make sense of it. Most data analysis can’t happen without it. But every cleaning choice changes what the data means. If you drop lines with missing numbers, you might only see part of the story. If you fill in missing numbers with an average, you hide real differences. Removing outliers might get rid of the most important facts. Changing names to standard labels can make you lose details. The skill is not to avoid cleaning. That’s impossible. The real skill is to make every cleaning choice visible. You write it all down in the cleaning-log.

Tidy is very clear about one thing. She never calls cleaning “housekeeping.” She says it’s not like tidying your room. “Cleaning is not neutral,” she tells her students. “Every cleaning choice changes the meaning.” She pauses, looking at each student. “Write down your choices. The next person who looks at this data needs to know. Even future-you needs to know. What did you do? Why did you do it? What did the data look like before?”

She explains that without the cleaning-log, no one can repeat the analysis. It’s like a secret recipe. Most people think data cleaning is just extra work. They think it’s not important. But Tidy says that’s wrong. Cleaning has big choices hidden inside it. She wants everyone to see cleaning as real thinking work. It’s not just getting ready for the real work. It is the real work.

Tidy grew up in a small village. Her family had a special job there. They were the village’s grain-sorters. They sorted the yearly grain harvest. Some grain was for cooking. Some was for the mill to make flour. Some was for planting new seeds. This job needed very careful choices. Everyone had to know what counted as what. If a sorter couldn’t explain her choices, people stopped trusting her. The millers would take their grain elsewhere. By age six, Tidy understood this. She learned that sorting was always a choice. And those choices had to be clear. People needed to see them to trust them.

When Tidy was twenty-two, she walked to the DataForge academy. Datum, the head of the academy, asked her a question. “What is data cleaning?” Datum asked. Tidy answered right away. “It’s getting data ready, but doing it right,” she said. “Every cleaning choice changes the meaning. Write down the choices. The cleaning-log is like the data’s memory. Without it, the analysis is built on secret choices.” Datum listened closely. Then Datum smiled. “You are hired,” Datum said.

In her workshop, Tidy starts every first lesson the same way. She opens her cleaning-log. It lies flat on her workbench. She writes the name of the data at the top of a new page. Then she looks at her students. “I am Tidy,” she says. “The skill I teach is cleaning. The main rule is: write down every choice. Every cleaning step has other ways to do it. The way you choose changes the whole analysis. Make your choices visible.”

She teaches her students the cleaning rules:

Read Catch’s notes first. You need to know how the data was gathered.
Look closely at the data before cleaning. Check the first 20 rows. See the quick numbers. How are the numbers spread out? Know what you start with.
Find the cleaning problems. Are there missing numbers? Are there duplicates? Any typos? Outliers? Do formats match?
For each problem, list other ways to fix it. If numbers are missing, you could drop the row. You could fill it with the average. Or the middle number. Or even a smart guess. You could also leave it missing. Each way has good and bad sides.
Choose on purpose. Don’t just pick the first thing. Know why you are choosing it.
Write down your choice in the cleaning-log. Note the date and the data. What did you do? Why? What did it look like before? What does it look like now?
Keep the original data safe. Never write over the raw data. Always work on a copy.
Share the cleaning-log. Other people who use the data will need it. Future-you will need it too. The log is part of the data.

Tidy is very clear about mistakes. “I sometimes make a cleaning choice that’s wrong,” she says. “I find out later. That’s not failing. That’s why the log is here. I can go back. I can change my choice. Then I update the log.” She smiles. “Being clear about it is the main thing.”

Students often ask Tidy if data cleaning is hard. Tidy always gives the same answer. “It is not hard,” she says. “It’s about choosing on purpose. And writing things down carefully. Every cleaning choice changes the meaning. So, write down the choices.”

She closes her cleaning-log gently. The next set of data is waiting. It’s ready for Tidy’s careful hands.

The DataForge ensemble

Tidy is part of DataForge's distributed-narrative cast. Each character embodies a different curricular primitive; together they teach the full subject.

Tidy

Chapter 2 — Tidy and the Cleaning-Log

The DataForge ensemble

Catch

Graph

Tell

Guard

Reading Access