Episode summary

In Episode 2 of '100 Days of Data,' Jonas and Amy dive into the foundational concept that not all data is created equal. They explore the three primary types of data—structured, unstructured, and semi-structured—offering real-world examples and insights into how each impacts AI workflows. From neatly organized spreadsheets to messy sensor logs and flexible JSON files, understanding these categories is crucial for effective data management and successful AI outcomes. Listeners learn how data formats influence everything from storage and processing to the performance of models and tools. With practical stories from finance, healthcare, and retail, this episode demystifies complex data types and emphasizes the value of identifying them early in any AI project.

Episode video

Episode transcript

JONAS: Welcome to Episode 2 of 100 Days of Data. I'm Jonas, an AI professor here to explore the foundations of data in AI with you.
AMY: And I, Amy, an AI consultant, excited to bring these concepts to life with stories and practical insights. Glad you're joining us.
JONAS: Not all data is created equal. That’s our hook for today—understanding the different types of data can unlock how businesses really leverage AI.
AMY: Absolutely, Jonas. I see so many companies jump straight into AI projects without really stopping to think about the kind of data they have. And trust me, that can slow things down fast.
JONAS: Let’s start by setting the stage. At a very high level, data is often categorized into three types: structured, unstructured, and semi-structured. Each comes with a distinct format and use case.
AMY: So, structured data is the stuff most people are familiar with, right? Like spreadsheets or databases?
JONAS: Exactly. Structured data lives in neat rows and columns. Think of a spreadsheet with names, dates, sales numbers—all clearly defined fields. This makes it straightforward to store, search, and analyze.
AMY: And businesses love this kind of data because it’s easy to plug into analytics tools or AI models. For instance, retail companies track sales transactions in structured databases to predict demand or segment customers.
JONAS: Right. The history of structured data actually dates back decades, tied closely to relational databases developed in the 1970s. These databases enforce a schema, a blueprint that defines what kind of data belongs where.
AMY: That schema is a lifesaver in many projects. I remember working with a financial firm where every transaction had an exact place in their tables—date, amount, merchant code. Because of that structure, they quickly built fraud detection models.
JONAS: But not all data fits so neatly into rows and columns. That brings us to unstructured data. This includes text documents, images, audio files, videos—data without a predefined model.
AMY: Unstructured data is massive in the real world. Social media posts, customer reviews, emails—you name it. For example, I helped a healthcare company analyze doctors’ notes and patient feedback, which were all unstructured text. Extracting insights from that wasn’t easy without the right AI tools.
JONAS: Definitely. Unstructured data is raw and more difficult to analyze directly. It usually requires preprocessing steps to convert it into something machines can understand, like vectors or tokens.
AMY: And that’s where AI techniques like Natural Language Processing come in, right? I’ve seen chatbots trained on mountains of unstructured customer emails to provide automated support.
JONAS: Exactly. It’s fascinating because unstructured data has grown explosively with the digital age. Early databases couldn’t handle this type of data well, but advances in machine learning have opened up tremendous possibilities.
AMY: But wait, what about semi-structured data? That’s the middle ground?
JONAS: Correct. Semi-structured data doesn’t fit the strict schema of structured data, but it contains tags or markers to separate elements. Examples include JSON files, XML documents, or even emails with headers.
AMY: I work with semi-structured data all the time. Take JSON—it’s everywhere in APIs and web services. For example, automotive companies aggregate sensor data from cars in JSON format to monitor vehicle health in real time.
JONAS: Semi-structured data is flexible. It allows records to have varying fields, so you can represent complex information without a rigid schema. This calls for databases specifically designed for it, like NoSQL.
AMY: One challenge is that semi-structured data can be messy. A client once gave me logs from their manufacturing equipment, full of inconsistent tags. Cleaning and standardizing this data was a big part of the project before we could apply predictive maintenance models.
JONAS: That’s a great example of how understanding the type of data shapes the entire AI pipeline—from storage and cleaning to analysis and modeling.
AMY: To sum it up, knowing whether your data is structured, unstructured, or semi-structured isn’t just academic; it impacts the tools you choose, the time you spend preparing data, and ultimately your AI success.
JONAS: And as data grows in variety and volume, this classification helps us think critically about the right frameworks and architectures to use.
AMY: Jonas, sometimes I wonder if this classification feels a bit too rigid given how hybrid our data can be. Like, a single customer record might have a table row of purchases but also free-text notes and images. How do we tackle that?
JONAS: That’s an insightful point. Many modern systems are designed to be polyglot—handling multiple data types simultaneously. The field is moving toward integrating these to form a 360-degree view of data.
AMY: I see that especially in retail, where combining structured sales data with customer sentiment from social media can drive better recommendations. It’s the hybrid approach powering smarter AI.
JONAS: So, understanding these three types is foundational, but so is recognizing that real-world data pipelines often intersect between them.
AMY: Agreed. Also, the challenges differ. Structured data usually has high quality and consistency, unstructured data is large and noisy, and semi-structured data can be unpredictable.
JONAS: For the curious, diving deeper into data schemas, data lakes, and data warehouses can illuminate how organizations architect their information depending on data types.
AMY: And from the business side, knowing your data types early prevents costly surprises. I’ve seen projects stall because teams underestimated the complexity of unstructured data.
JONAS: That’s why educating teams on these basics is so important. It lays the groundwork for effective AI adoption.
AMY: Before we wrap, here’s a quick story: A healthcare startup I worked with tried to build AI models on unstructured medical records without realizing half the data was stored differently across sites—in both unstructured notes and semi-structured XML formats. Once we identified these types, we restructured their data pipelines, improving model accuracy dramatically.
JONAS: That’s a perfect case showing theory meeting practice. Recognizing data formats leads directly to better AI outcomes.
AMY: Definitely. So, key takeaway time?
JONAS: Sure. Understanding the fundamental types of data—structured, unstructured, and semi-structured—provides the necessary lens to design AI systems effectively and choose appropriate tools.
AMY: And from my angle: knowing your data types upfront saves time, reduces risk, and helps you create solutions that really drive business impact.
JONAS: Next time, we’ll look at the sources of data—where this data comes from and how organizations collect it.
AMY: Excited about that one! The origin story of data is often overlooked but so critical.
JONAS: If you're enjoying this, please like or rate us five stars in your podcast app. We’d love to hear your comments or questions—some might just make it into a future episode.
AMY: Thanks for spending time with us. Until tomorrow — stay curious, stay data-driven.

Next up

Tomorrow, discover where data really comes from and how organizations gather it in Episode 3.