Episode summary

In Episode 6 of '100 Days of Data,' Jonas and Amy delve into the essential process of data collection—the starting point of any AI journey. They explore four major methods: surveys for gathering subjective insights; logs that capture automated digital interactions; sensors that provide real-time physical measurements; and web scraping, a technical way to extract online information. Through real-world examples and practical advice, the hosts highlight how different methods suit different goals—and how combining multiple sources can lead to richer, more accurate data for AI models. They also stress the importance of data quality, privacy, and ethical practices in collection. Whether it’s understanding customer behavior or tracking machine performance, this episode lays the groundwork for smarter, more responsible AI development.

Episode video

Episode transcript

JONAS: Welcome to Episode 6 of 100 Days of Data. I'm Jonas, an AI professor here to explore the foundations of data in AI with you.
AMY: And I, Amy, an AI consultant, excited to bring these concepts to life with stories and practical insights. Glad you're joining us.
JONAS: Let’s kick things off with this: Surveys and sensors — the art of gathering information. That’s what today’s episode is all about.
AMY: I love that. Data collection is where everything begins, right? Without good data, even the smartest AI is just guessing in the dark.
JONAS: Exactly, Amy. So, to get started, let’s talk about what data collection actually means. At the simplest level, it’s the process of gathering information to answer questions or solve problems. But the methods we use vary widely depending on what we want to know and how we want to use that data.
AMY: And in business, you see all sorts of approaches—surveys, logs, sensors, web scraping—you name it. Each comes with its own strengths and challenges.
JONAS: Right. Let’s break those down one at a time. The first one many people are familiar with is surveys. Essentially, surveys are structured ways to ask people questions and record their answers.
AMY: That’s a classic. I remember working with a retail chain that did customer satisfaction surveys after every purchase. They’d ask about the shopping experience, product availability, even staff friendliness. The trick was keeping the surveys short enough so customers actually completed them.
JONAS: Surveys have a long history in research and statistics. They’re designed carefully to reduce bias and ensure reliability. For example, questions must be clear and neutral to avoid steering answers in one direction.
AMY: And then there’s the practical side, like how you deliver the survey. Online forms, phone calls, in-person interviews—each affects response rates and data quality. In the retail story I mentioned, shifting from paper surveys to quick text-based ones boosted responses significantly.
JONAS: Another key point about surveys is that the data you get is mostly subjective. It reflects opinions, feelings, or intentions—things that aren’t directly measurable otherwise.
AMY: That’s why combining survey data with other types, like transaction logs, can give a fuller picture. For example, in finance, a bank might survey a customer about satisfaction but then analyze transaction logs to see actual spending behavior.
JONAS: Good segue! Logs are another essential data collection method. They are records automatically generated by systems—things like website clicks, app usage, or server activity.
AMY: Logs are goldmines for businesses. When a streaming service sees how long you watch a show, what you skip, or what day you binge-watch, that data helps tailor recommendations and plan content.
JONAS: The beauty of logs is that they’re passive; they don’t rely on anyone’s willingness to respond. They capture what’s happening in real-time, often at large scale.
AMY: But there are challenges too. Logs can be messy or incomplete. Think about a factory floor: if a sensor malfunctions and stops logging data for a while, you get gaps.
JONAS: That’s true. Missing or noisy data is a common issue. And we have to be careful with privacy too—log data can reveal sensitive user behavior.
AMY: Privacy has been a big topic in consulting lately. We advise companies to anonymize logs and get proper consent, especially with regulations like GDPR in place.
JONAS: Now, moving on—web scraping is an interesting method you don’t hear about as much outside technical circles.
AMY: But it’s huge in practice! Web scraping means automatically extracting data from websites. For example, a travel company might scrape airline prices across many sites to compare deals in real-time.
JONAS: Technically, web scraping involves writing code or using tools that pull information like text, prices, or images from web pages without manually copying and pasting.
AMY: There are legal and ethical considerations too—some websites forbid scraping in their terms of service, so businesses have to be cautious.
JONAS: Indeed. From a theoretical perspective, web scraping can massively increase the scope of data available, especially for market intelligence or competitive analysis.
AMY: I worked with an automotive client who used scraping to monitor competitor vehicle specs and pricing across different markets. It helped them adjust their strategies quickly.
JONAS: Sensors are another fundamental way to collect data, especially in the physical world. Sensors convert physical phenomena—like temperature, motion, or light—into data we can analyze.
AMY: This is where the Internet of Things, or IoT, really shines. Think about smart cars with sensors tracking engine performance, tire pressure, or driver behavior. That data allows predictive maintenance and safety improvements.
JONAS: Exactly, Amy. Sensors generate continuous streams of data, often called time series data, that can reveal patterns over time.
AMY: Healthcare is another big area for sensors. Wearable devices track heart rate, sleep, and activity, feeding that data into AI models that can detect early signs of health issues.
JONAS: It’s worth noting the difference in data types here: surveys are often categorical or ordinal—answers like yes/no or ratings from 1 to 5. Logs and sensor data tend to be more granular and numerical.
AMY: That differences matter when it’s time to analyze. For example, analyzing logs from an e-commerce site to improve user experience calls for different techniques than interpreting survey results about customer feelings.
JONAS: One last point on data collection methods is the importance of quality. Garbage in, garbage out is a classic saying in data science.
AMY: Absolutely. In one project, a logistics company collected GPS data from trucks but found some sensors gave wildly inaccurate positions. Fixing or filtering that data was crucial before feeding it to their routing AI.
JONAS: So, to summarize the main methods we've discussed: surveys gather subjective human input; logs record system events automatically; web scraping extracts publicly available web data; and sensors capture real-world physical measurements.
AMY: And each method plays a unique role, depending on the business context and the problem they're trying to solve. Often, the best approach is a blend—combining surveys with logs or sensor data for a more complete story.
JONAS: That combination is indeed powerful. Hybrid datasets can improve AI models by providing multiple perspectives.
AMY: In the retail world, combining foot traffic logs with customer satisfaction surveys gave better insights into store performance than either alone.
JONAS: As we think about data collection, we must also keep ethics and privacy at the forefront. Collecting data responsibly isn’t just good practice—it’s a necessity, especially with growing regulations.
AMY: Right. In my consulting work, one of the first questions we ask is: Are we respecting user consent? Can users opt out? Transparency builds trust, which is essential for long-term success.
JONAS: On that note, it’s also important to consider the context in which data is collected. Cultural differences, survey wording, and data source reliability all affect results.
AMY: So many companies underestimate how much groundwork is needed before the AI even sees the data. Data collection is not just a technical step—it’s strategic.
JONAS: Couldn’t agree more. Now, as we wrap up, let’s give our listeners a key takeaway.
AMY: I'll start: Data collection is the foundation of any AI project, and choosing the right methods—surveys, logs, sensors, or web scraping—can make or break your success. Make sure you think about quality, ethics, and business goals from the start.
JONAS: Well said, Amy. To add, understanding the strengths and limitations of each data collection method empowers you to design smarter AI solutions that truly solve problems—not just chase metrics.
AMY: Next episode, we’ll dive into Data Storage—the place where all this collected information lives and how it’s managed for AI to use.
JONAS: If you're enjoying this, please like or rate us five stars in your podcast app. We’d also love to hear your questions or comments—you might hear them featured in future episodes.
AMY: Until tomorrow — stay curious, stay data-driven.

Next up

Tomorrow’s episode explores how data is stored and managed—turning raw inputs into accessible, organized resources for AI.