Episode summary

In Episode 31 of '100 Days of Data,' titled 'Data Preparation for AI,' Jonas and Amy explore why up to 80% of AI work revolves around data cleaning, preprocessing, and feature engineering. They break down these critical steps in transforming raw, messy data into structured formats ready for modeling. From handling missing values and duplicates to encoding categories and scaling features, the hosts blend theory with vivid real-world examples—including use cases in automotive, retail, logistics, and healthcare. Listeners also learn how effective feature engineering combines creativity and domain knowledge to boost model performance. Whether you're building AI applications or advising businesses, this episode provides a clear roadmap to data preparation as the foundation of trustworthy, high-impact AI systems.

Episode video

Episode transcript

JONAS: Welcome to Episode 31 of 100 Days of Data. I'm Jonas, an AI professor here to explore the foundations of data in AI with you.
AMY: And I, Amy, an AI consultant, excited to bring these concepts to life with stories and practical insights. Glad you're joining us.
JONAS: Here’s a powerful fact to start us off — about 80% of AI work is cleaning data.
AMY: Which is crazy if you think about it! The flashy AI models get all the spotlight, but without clean data, those models are basically flying blind.
JONAS: That’s exactly why today’s episode is all about data preparation for AI. We’ll dive into the theory behind why data cleaning and preprocessing are foundational, and Amy will share how this looks in real businesses.
AMY: Perfect. So Jonas, let’s start simple — when we say data preparation, what do we actually mean?
JONAS: Great place to start. At its core, data preparation is the process of transforming raw data into a form that’s suitable for training AI models. Raw data—whether it’s spreadsheets, sensor readings, or customer info—is often messy. It can have missing values, errors, inconsistencies, or irrelevant parts.
JONAS: We call the process of fixing these issues 'cleaning.' Then there’s 'preprocessing,' which often involves normalizing or scaling data, converting it into formats the AI algorithms can understand. And finally, 'feature engineering,' which is creating new input variables that better represent the problem we're trying to solve.
AMY: That sounds like a lot of pieces, but it makes sense. In the trenches, I've seen that skipping or rushing these steps comes back to bite companies hard. One example was with an automotive client using sensor data from engine tests. The raw sensor streams were noisy and had missing chunks. Before we even touched any model, we had to smooth out those measurements and fill in gaps—otherwise, the AI would misinterpret the engine’s performance.
JONAS: Exactly. To put it simply, if your data is like ingredients, data preparation is the cooking step that makes them usable. Raw ingredients might be spoiled, chopped the wrong way, or missing altogether — and no recipe, no matter how good, will save a dish made from bad ingredients.
AMY: I love that analogy! And to add, sometimes you need to invent new ingredients, like spices. That’s feature engineering. For example, in healthcare, combining heart rate and activity level into a single ‘cardio stress’ feature often helps models predict risk better than raw numbers alone.
JONAS: Nice example. Historically, feature engineering was one of the most important parts of AI work. Before deep learning took over, models almost entirely depended on the quality of those features. Even now, smart features can significantly boost performance and reduce training time.
AMY: From the business side, it’s a game-changer. I remember helping a retail chain improve their sales forecasts. They engineered features incorporating holidays, weather patterns, and local events—which sounds obvious now but was new to the team. Those features lifted forecast accuracy by more than 15%.
JONAS: So let’s unpack data cleaning first since it comes at the start of the preparation pipeline. What kinds of problems typically require cleaning?
AMY: Well, missing data is huge. Sometimes there are simply blank fields, sometimes entire rows missing critical info. Then there’s errors—like typos, duplicated entries, or out-of-range values, for example, a recorded age of 200 years old. Also, inconsistent formats—dates that switch between MM/DD/YYYY and DD/MM/YYYY, or phone numbers with different country codes.
JONAS: Yes. Handling missing data is a key technique. You can drop rows or columns with too many missing values, but that risks losing important information. More often, you fill in—or impute—missing values using statistics like the mean or median, or more complex algorithms that estimate them based on other data.
AMY: I’ve seen simple imputations work surprisingly well in practice. At one financial client, filling missing credit score values with median scores allowed the AI to start producing solid risk assessments quickly. Later they developed more refined imputations.
JONAS: Another big topic in cleaning is duplicate entries. Duplicate data can distort analyses and bias models. Deduplication is the process of finding and removing those copies. It sounds trivial but can be tricky, especially when entries are nearly identical, just formatted differently.
AMY: Totally. One challenge was a healthcare provider merging patient records from multiple clinics. Simple ID matching wasn’t enough, because the same person’s name might be spelled slightly differently. We had to use fuzzy matching algorithms combined with manual review to ensure accuracy.
JONAS: That’s a great example of the human-in-the-loop aspect. While many cleaning parts can be automated, critical judgment calls often require domain expertise.
AMY: Moving onto preprocessing — this is where we put data into the right shape for the model. For example, many AI algorithms expect numerical inputs normalized in a certain range, like 0 to 1. So we apply scaling or standardization to features.
JONAS: Yes. Imagine you have two features: one measuring income in thousands of dollars, values in the tens of thousands, and another measuring age in years, usually below 100. Without scaling, the larger numbers dominate calculations, biasing models. Normalization equalizes this.
AMY: Another preprocessing step is encoding categorical variables. AI models don’t speak English or category names — they speak numbers. So categories like “Red,” “Blue,” “Green” for a product color must be converted, typically into one-hot vectors or ordinal values.
JONAS: Good point. Choosing the encoding method affects model performance and interpretability. One-hot encoding creates binary flags for each category, which is useful when categories have no inherent order. Ordinal encoding assigns numeric ranks where order matters.
AMY: Also, sometimes preprocessing includes text cleaning for NLP models — removing punctuation, stop words, or converting words to their root forms, like “running” to “run.”
JONAS: That feeds into the idea that data preparation varies a lot depending on the data type — structured tables, images, sensor time series, text, and more. The principles remain but implementations differ.
AMY: Yes, and that’s one thing I emphasize with clients. Don’t think of data prep as a one-time chore — it’s deeply tied to the problem, data source, and model. It’s iterative and requires constant refinement.
JONAS: Before we wrap, let’s revisit feature engineering briefly. It’s about creativity as much as science. For example, combining existing data columns to create new ones, extracting trends from time series, or aggregating information over groups.
AMY: I love sharing this story. For a logistics company, raw GPS data gave locations and speeds. But after creating new features like “average speed over last 10 minutes” and “time spent idling,” the AI could better predict maintenance needs — reducing downtime and costs significantly.
JONAS: That story underlines how feature engineering connects domain knowledge with data science. It often requires people who understand the business deeply alongside technical experts.
AMY: Absolutely. And sometimes, the data preparation effort dwarfs the time spent on modeling. But when done right, it pays off with better accuracy, more trust in results, and smoother deployment.
JONAS: To sum up, clean, well-prepared data is the foundation of any successful AI project. It’s where theory meets the messy realities of the real world.
AMY: And from the business angle, smart data preparation drives better decisions, more reliable AI outputs, and ultimately, competitive advantage.
JONAS: Key takeaway — investing time and effort upfront into data cleaning, preprocessing, and feature engineering can make or break your AI initiatives.
AMY: Yep, and don’t underestimate the power of thoughtful feature engineering to unlock insights hidden in your data.
JONAS: Next time on 100 Days of Data, we’ll dive into model training and testing — how we turn prepared data into AI that learns and generalizes.
AMY: Can’t wait! If you're enjoying this, please like or rate us five stars in your podcast app. And if you have questions or topics you'd like us to cover, send them our way — we might feature them in future episodes.
AMY: Until tomorrow — stay curious, stay data-driven.

Next up

Next time, Jonas and Amy dive into model training and testing — where AI learns from clean, prepared data.