Study PMI-CPMAI Required Data, Labels, and Evidence: key concepts, common traps, and exam decision cues.
On this page
Data requirements in AI projects must describe more than the raw fields a model might ingest. The project also needs to define the labels, contextual evidence, metadata, and business validation records that will later support training, testing, review, and operational monitoring. PMI-CPMAI usually favors the team that defines those requirements early enough to drive planning, rather than assuming the data team can “figure it out later.”
Start With The Decision, Not With The Database
Teams often begin with the data they already own. That is understandable, but it is a weak starting point. The better starting point is the business decision or workflow outcome the AI system is meant to support. Once that is clear, the project can ask:
what signals are needed to make or support that decision
what outcome or label defines success or failure
what contextual information is needed to interpret the record correctly
what evidence will later prove the system is performing acceptably
This is why two AI projects inside the same organization can need very different data, even if they touch the same systems. A churn prediction use case, a fraud alert ranking use case, and a knowledge-assistant use case all pull value from different types of evidence.
Inputs, Labels, And Evidence Are Different Things
Project teams often blur three categories that should remain separate:
inputs, which are the observable fields or content the system uses
labels or target outcomes, which define what the model is expected to learn or predict
business evidence, which is used to validate whether the system is actually helping in practice
For example, an AI model may use claim details as inputs, historical claim decisions as labels, and later operational metrics such as turnaround time, appeal rate, or manual override rate as business evidence. If the team defines only the inputs, the project may still be unable to train or evaluate the system responsibly.
flowchart LR
A["Business decision or workflow outcome"] --> B["Required inputs and context"]
A --> C["Labels or target outcomes"]
A --> D["Business validation evidence"]
B --> E["Data planning and feasibility"]
C --> E
D --> E
The point is not to document every field immediately. The point is to define the categories of evidence that later planning must satisfy.
Data Requirements Should Be Specific Enough To Expose Feasibility Risk
A vague statement such as “we need customer data” or “we need historical cases” is not enough. Stronger data requirements typically clarify:
the period of history required
the level of granularity
the need for complete or partial labels
whether timeliness or freshness matters
whether context from other systems is required
whether humans will need evidence explaining why a record was treated a certain way
That specificity matters because feasibility often breaks on one of these details. A use case may look promising until the team discovers the available data is too stale, too aggregated, inconsistently labeled, or missing the contextual evidence needed for trustworthy evaluation.
Label Definition Is A Governance Question Too
Labels are not just a technical training detail. They represent institutional judgment. If a project trains on historical outcomes without checking what those outcomes actually mean, it may reproduce inconsistent business behavior or embed old bias into a new system.
That is why label definition usually requires domain experts, data owners, and governance participants. The team should ask:
what does the label really mean
who created it and under what rules
whether the historical outcome is trustworthy enough to learn from
whether the label reflects desired future practice or only past behavior
If the answer is unclear, the project may need relabeling, supplemental review, or an alternative target.
Business Validation Evidence Belongs In The Requirement Set
Some teams define data only for training and forget the evidence they will need later to validate usefulness. That is a mistake. The project should define what evidence will help answer questions such as:
Is the AI improving the business outcome it was meant to improve?
Are users overriding it frequently?
Is performance stable across relevant cases or groups?
Are downstream process outcomes actually improving?
This evidence may come from operational logs, review records, audit trails, or later human decisions. It still needs to be planned early because those records often require instrumentation, integration, and control design.
Good Requirement Definitions Help Later Phases
Clear data requirements make later phases easier because they guide source inventory, access planning, quality assessment, and evaluation design. They also help leadership understand why a promising use case may still be data-constrained. A disciplined requirement set keeps the project honest: it prevents optimistic approval based only on the idea of AI rather than on the evidence needed to deliver it responsibly.
Example
A hospital wants AI support for radiology report triage. The project first says it needs imaging records and final case outcomes. That is too broad. A stronger requirement set would define report text, timestamps, ordering context, confirmed downstream disposition, labeling rules for urgency, reviewer override records, and later evidence showing whether triage actually improved time-to-action without creating unsafe misses.
Common Pitfalls
Starting with available data instead of starting with the decision the AI must support.
Defining inputs but not defining labels or business validation evidence.
Treating labels as objective facts when they may reflect inconsistent historical judgment.
Ignoring freshness, granularity, or contextual completeness until late in the project.
Assuming evaluation evidence will appear automatically after deployment.
Check Your Understanding
### What is the strongest starting point for defining AI data requirements?
- [x] The business decision or workflow outcome the AI must support
- [ ] The largest data source currently available in the organization
- [ ] The model architecture the technical team prefers
- [ ] The storage platform already approved for the project
> **Explanation:** Data requirements should be driven first by the decision, outcome, and evaluation need, not by convenience or tool choice.
### Which item best represents business validation evidence rather than a model input?
- [ ] The raw transaction fields used to generate a prediction
- [x] The later operational record showing whether the AI-supported action improved the outcome
- [ ] The text embedding produced during preprocessing
- [ ] The feature engineering script used by the data science team
> **Explanation:** Business validation evidence helps prove whether the system actually delivered the intended organizational value.
### Why should label definition involve more than the model team?
- [ ] Because labels are mainly a compute-planning question
- [ ] Because any historical outcome is automatically a valid label
- [x] Because labels reflect business meaning and may embed old policy or judgment problems
- [ ] Because labels only matter after deployment
> **Explanation:** Label meaning and quality are governance and domain questions, not just technical implementation details.
### Which response is usually weakest?
- [ ] Clarifying how freshness and granularity affect feasibility
- [ ] Defining both training needs and later validation evidence
- [ ] Checking whether the label reflects desired future practice
- [x] Assuming the project can approve the use case now and work out missing evidence later if needed
> **Explanation:** Deferring undefined evidence requirements weakens feasibility, governance, and evaluation planning.
Sample Exam Question
Scenario: A retailer wants an AI model to prioritize customer complaints for escalation. During planning, the team identifies message text and complaint category fields, but no one has defined what successful prioritization would look like or what records would later confirm business value.
Question: What is the strongest next step in defining data needs?
A. Expand the data requirement set to include target labels, contextual evidence, and later business validation records before treating the use case as fully planned
B. Ask the data team to begin model prototyping because the current fields are enough to prove whether the idea works
C. Wait until deployment to decide what evidence should be tracked because operational data will be more realistic then
D. Focus only on data volume first because the meaning of labels can be corrected after training
Best answer: A
Explanation:A is best because AI data planning must include not only inputs, but also labels, context, and business validation evidence. Without those definitions, the project cannot judge feasibility, evaluate performance responsibly, or show whether the system actually improves the target workflow.
Why the other options are weaker:
B: Prototyping without defined labels and evaluation evidence may produce misleading progress.
C: Waiting until deployment is too late for responsible planning and instrumentation.
D: Volume matters, but it does not compensate for unclear label meaning or missing outcome evidence.