Episode summary

In Episode 8 of '100 Days of Data,' Jonas and Amy dive into the world of data formats—those essential structures like TXT, CSV, XML, and JSON that shape how information is stored, shared, and understood in AI systems. They break down the strengths and tradeoffs of each format, from the simplicity of plain text to the hierarchical power of XML and the lightweight flexibility of JSON. The conversation also explores how binary formats like JPEG, MP4, and others serve non-textual data like images and video. Real-world examples from healthcare, retail, and automotive industries illustrate how choosing the right format impacts everything from data integration to AI model success. This episode is a must for understanding the invisible backbone of data pipelines.

Episode video

Episode transcript

JONAS: Welcome to Episode 8 of 100 Days of Data. I'm Jonas, an AI professor here to explore the foundations of data in AI with you.
AMY: And I, Amy, an AI consultant, excited to bring these concepts to life with stories and practical insights. Glad you're joining us.
JONAS: Ever wondered why TXT, CSV, XML and JSON rule the world of data? Today, we’re going to unravel that mystery.
AMY: Yeah, these formats are behind so much of what we do with data every day — but how often do we actually think about what they are or why they matter? Let’s dive in.
JONAS: To start, it’s helpful to think of data formats as different languages for storing and exchanging information. Each has its own strengths and quirks, just like spoken languages.
AMY: That’s a great analogy. And just like you wouldn’t use Shakespearean English ordering coffee, you pick the best format for the job. So what’s the simplest format to start with?
JONAS: Plain text, or TXT files, are probably the simplest. They store data as raw characters, readable to humans and machines alike. It’s just unformatted text, no bells or whistles.
AMY: I run into TXT files all the time—logs from systems, rough notes, or even simple data dumps. But while TXT is super simple, it doesn’t organize data well, right? There’s no structure, so it’s hard to parse or analyze large datasets stored this way.
JONAS: Exactly. That’s why more structured formats emerged. Enter CSV, which stands for Comma-Separated Values. It’s a way to put tabular data—think spreadsheets or database rows—into a plain text file, using commas to separate values.
AMY: Oh, CSV is everywhere in business. I’ve seen automotive companies export vehicle sensor data in CSV to quickly share with engineers. It’s lightweight and easy to open anywhere—from Excel to Python scripts.
JONAS: Plus, CSV is remarkably straightforward: rows represent records, and commas split each field. But, as simple as it is, CSV has limitations. It doesn’t handle nested or hierarchical data well. For example, if you want to store a customer order that contains multiple items, CSV struggles.
AMY: Right, because CSV flattens everything into rows and columns. In retail, where orders often have multiple line items, CSV can get messy or require multiple linked files. That’s where newer formats like XML and JSON come in.
JONAS: XML, or Extensible Markup Language, appeared in the late 1990s, designed to store and transport data with a flexible, hierarchical structure. It uses tags—likeand—to define elements and nest information.
AMY: I remember working with XML in healthcare projects. It was the standard for exchanging clinical documents. Its strict tags made it easy to validate data, which is huge when you’re dealing with sensitive info like patient records.
JONAS: Indeed. XML’s strength is in its ability to represent complex data hierarchies and schemas. However, it’s quite verbose — making XML files larger and sometimes slower to parse.
AMY: That verbosity can be a pain. I’ve seen companies waste storage and bandwidth just because XML files are bloated. Plus, XML’s syntax can be intimidating for non-technical folks who need to audit data.
JONAS: To address some of those concerns, JSON — JavaScript Object Notation — gained popularity since the early 2000s. JSON is concise, readable, and easy to parse. It represents data as key-value pairs and arrays, making it ideal for structured and hierarchical data.
AMY: JSON has been a game changer in tech. Most APIs now send data in JSON format. When I built an AI-powered recommendation engine for a retailer, we used JSON to move product and user data around. Its simplicity helped speed up development and integration.
JONAS: JSON files look similar to how you might describe data in natural language. For example, a customer order in JSON might look like this: an object with keys for \"orderId\", \"customer\", and an array listing the \"items\" purchased. Each item is itself an object.
AMY: And that nesting is huge. It matches how we mentally think about data relationships. It also maps easily to programming languages’ data structures—so no complicated parsing logic is needed.
JONAS: Before we move on, it’s important to remember these formats mostly focus on textual or tabular data. But what about non-text data like images and video?
AMY: Great point. In the AI world, images and videos are massive data sources, especially for computer vision tasks. They’re stored in binary formats — like JPEG or PNG for images, and MP4 or AVI for videos — which encode data very differently.
JONAS: Binary formats are optimized for efficient storage and processing of large multimedia files. Unlike text formats, they don’t aim to be human-readable. They compress data and organize it in ways that software and hardware can quickly decode.
AMY: In practical terms, this means a healthcare AI system analyzing X-rays needs the images in a high-quality binary format. Similarly, autonomous cars process continuous video streams to detect obstacles. These files are huge and require specialized storage.
JONAS: Exactly. And while the image or video data itself is in a binary format, the associated metadata about those files—like timestamp, resolution, or patient info—is often stored separately in text-based formats like XML or JSON.
AMY: I actually saw a project where an automotive company used JSON files to accompany video files from dash cams. The JSON carried sensor data and event labels, syncing them with video timestamps. This combined approach delivered much richer AI insights.
JONAS: That’s a great example of combining formats thoughtfully. Understanding their strengths and limitations helps us design better data pipelines.
AMY: Speaking of pipelines, I find that many companies struggle with the “format mismatch” problem. They get data in CSV from one system, JSON from another, images and videos from sensors—and suddenly integration gets complicated.
JONAS: That’s true. Converting between formats is often more than a technical headache; it impacts data quality and timely access to insights.
AMY: From my consulting experience, investing early in choosing the right data formats — or using flexible ones like JSON — can save time down the road. It also future-proofs the system as AI models evolve.
JONAS: So, to summarize: TXT is plain text and great for simplicity, CSV adds structure for tables, XML is detailed and hierarchical but verbose, JSON is lightweight and flexible, perfect for most modern use cases, and binary formats serve for images and videos.
AMY: And in the real world, it’s usually a blend. Think of formats like tools in a toolbox—each designed for specific types of data and tasks. Knowing when and why to use each makes all the difference in effective AI projects.
JONAS: Key takeaway for our listeners: Data formats are the unsung backbone of AI and data workflows. They shape how we store, share, and understand data—and choosing the right one is critical.
AMY: Absolutely. From my side, I’d say don’t overlook data formats as a purely technical detail. They influence speed, scalability, and even the success of your AI initiatives.
JONAS: Next episode, we’ll explore metadata—the data about data—which ties closely to what we discussed today and is vital for organizing and finding information efficiently.
AMY: If you're enjoying this, please like or rate us five stars in your podcast app. We’d love to hear your questions or comments about data formats, which might even feature in future episodes.
AMY: Until tomorrow — stay curious, stay data-driven.

Next up

In the next episode, Jonas and Amy explore metadata—how data about data powers organization, search, and smarter AI systems.