Collecting Reliable Data and Evidence
It’s the moment when the team sits down after a failure—equipment down, customer complaint logged, process delayed. The urgency is real. But the real work begins not with solving the problem, but with answering one question: What do we *know*, and how do we know it?
Too many teams rush into cause identification without locking down the facts. They rely on memory, assumptions, or vague recollections. That’s where the first gap in RCA becomes evident. You don’t analyze what you haven’t measured, and you can’t validate a cause without reliable data.
I’ve led dozens of RCA sessions across manufacturing, IT, and healthcare. One truth sticks: the most accurate root cause findings emerge from evidence—documented, timestamped, and verifiable. This chapter is about mastering the art and science of collecting reliable data and evidence before a single line is drawn on a Fishbone.
Here, you’ll learn how to gather the right data efficiently—observations, logs, metrics, and interviews—while avoiding common traps that lead to flawed conclusions. I’ll share field-tested practices for data validation in RCA and RCA documentation best practices that stand up under scrutiny.
Why Data Quality Defines RCA Success
Root cause is not a guess. It’s a conclusion drawn from evidence. If your data is weak, the entire investigation collapses.
The most dangerous assumption in RCA is that “everyone remembers what happened.” Memory is unreliable. Emotions color recall. People forget details, or misattribute sequences.
My rule: if it’s not documented, it doesn’t exist for analysis. That includes incident reports, system logs, equipment checklists, and even visual observations made during the event.
Where Most Teams Fail: The Evidence Gap
Here’s a common scenario: a production line stops. The shift leader says, “The machine just locked up.” The maintenance team arrives and fixes it. The operator says, “It happened right after the new batch came in.” That’s not evidence—those are anecdotes.
Real data would include:
- Timestamps from the SCADA system when the alarm triggered
- Machine temperature readings from the last 30 minutes
- Batch ID, weight, and material composition
- Shift change logs and who was on duty
- Photos of the machine post-event
Without these, you’re diagnosing from memory, not from reality. That’s why collecting evidence for root cause analysis must be systematic, not reactive.
Four Sources of Reliable Evidence
Effective RCA doesn’t depend on a single source. It requires triangulation—cross-verifying data from multiple channels. Prioritize these four:
1. Process Metrics and System Logs
Automated systems generate data that’s often the most objective. Think production throughput, error rates, temperature cycles, or network latency spikes.
Example: A software deployment fails. Instead of asking “Why?” immediately, pull the CI/CD pipeline logs. Look for:
- Build timestamps and duration anomalies
- Test failure patterns
- Deployment rollback triggers
- Log entries with ERROR or WARNING level
These provide timestamps, sequence, and causality markers—exactly what you need.
2. Direct Observation
Go to the scene. See what the machine looks like. Note unusual wear, debris, or positioning. Take photos or videos with timestamps.
Observation is powerful because it bypasses language and interpretation. A cracked seal, a misaligned part, a missing label—these are facts.
Do not rely on secondhand observation. If you didn’t see it, you can’t be sure it happened.
3. Workforce Interviews (Conducted Like a Journalist)
Interviews are not about blame. They’re about gathering firsthand perspectives. Approach them with curiosity, not accusation.
Ask open-ended, time-bound questions:
- “What were you doing when the system failed?”
- “What did you see, hear, or feel at that moment?”
- “What was the last thing you checked before the issue occurred?”
Record responses verbatim. Avoid leading or suggestive language. Don’t assume “operator error”—document the actions taken, then validate them.
4. Audit and Historical Records
Look beyond the incident. What was the last time this machine was serviced? Were there prior warnings? Were there similar issues in the past 90 days?
Check maintenance logs, incident databases, and past RCA reports. A pattern is often the first sign of a systemic root cause.
Validating Data: The Critical Step Before Analysis
Collecting data isn’t enough. You must validate it.
Every piece of evidence must answer three questions:
- Who collected it?
- When and where was it gathered?
- How was it verified?
Here’s a simple validation checklist:
| Data Type | Source | Validation Method |
|---|---|---|
| System log | SCADA server | Match timestamp to PLC event; cross-check with shift log |
| Photo | On-site technician | Include time, location, and identifier; verify with event timeline |
| Interview statement | Operator | Re-state in own words; confirm with colleague who was present |
Never treat evidence as “good enough.” If you can’t verify it, set it aside. Assume that unverified data is noise.
RCA Documentation Best Practices
Documentation isn’t bureaucracy. It’s evidence preservation. It ensures transparency, accountability, and repeatability.
Follow these best practices:
- Log all sources: Name, date, role, and contact info for each data point.
- Attach evidence: Include screenshots, photos, or log extracts in the report.
- Use traceable references: Number every data point (e.g., “Ref: Log-2024-04-05-14:30”) so it can be traced back.
- Clarify assumptions: If a data point is inferred, label it clearly as such.
- Separate facts from interpretation: Use a two-column format: “What was observed” vs. “What it might mean”.
When you hand over the RCA report, you must be able to defend every claim. If you can’t trace it back, it’s not valid.
One mistake I’ve seen: teams use vague phrases like “the system was slow” or “something went wrong.” That’s not documentation—it’s speculation.
Common Pitfalls in Data Collection
Even experienced teams stumble. Be aware of these traps:
- Cherry-picking data: Only using evidence that supports a preferred cause. This invalidates objectivity.
- Relying on authority: “The manager said it was a software glitch.” But if logs don’t show it, the statement is not evidence.
- Overloading with irrelevant data: Too many metrics can hide the real signal. Focus on what’s relevant to the effect.
- Missing timestamps: Without time context, sequences become meaningless.
Ask: “Could this data be misinterpreted? Could it be false? Is it independent of other sources?” If yes, re-verify.
Final Checklist: Are Your Data Ready?
Before moving to Fishbone or 5 Whys, confirm:
- Every evidence item has a source, timestamp, and owner.
- Data is cross-referenced across multiple sources.
- Interviews are recorded verbatim and verified.
- Logs and metrics are from the actual system, not summaries.
- Unverified data is flagged and excluded from analysis.
If your data doesn’t pass this checklist, pause. Go back. Do not proceed with analysis.
Frequently Asked Questions
How do I collect evidence for root cause analysis when no logs exist?
Start with observation and interviews. Ask: “What happened step by step?” “What did the operator see or feel?” “Was anything changed recently?” Use a timeline diagram to reconstruct events. Even without logs, documented observations and verified statements form valid evidence.
What if the data contradicts the team’s initial belief?
That’s expected—and good. Data should challenge assumptions. If the logs show a sensor failure, but the team thinks it was operator error, do not dismiss the data. Investigate the discrepancy. It may reveal a deeper process flaw in how alarms are reported.
Is it acceptable to use screenshots from a control panel as evidence?
Yes—but only if the screenshot includes a timestamp, location, and context. A raw image without metadata is not traceable. Always annotate with: “Screenshot taken during incident on 2024-04-05 at 14:30, showing alarm panel.”
How do I ensure data validation in RCA when working under pressure?
Build validation into your workflow. Assign a data verifier—someone not involved in the incident—to review all collected data. Use a checklist. For time-sensitive events, prioritize verifiable sources: logs, timestamps, photos. Never skip verification just for speed.
Can I use email or chat logs as evidence in RCA?
Yes, if they’re accurate and complete. Email chains, Slack messages, or ticketing systems can show decision points, approvals, and communication gaps. But verify the sender, time, and context. Avoid quoting messages out of context.
What’s the difference between data and evidence in RCA?
Data is raw information: numbers, timestamps, logs. Evidence is data that has been analyzed and contextualized. A temperature reading of 180°C is data. Saying “the overheating caused the shutdown” is evidence—because it links data to a cause. Always collect data first, then build evidence through analysis.