Data Management¶
Data is a fundamental element of every business and crucial for any AI/ML project. Data is our record of current state of the business, the history of what has happened, and is the base which enables us to predict what may happen in the future.
However, on its own data doesn't do anything and to realise value from data we have to do something with it, we have to understand it, and act on it. One of the biggest and most complex challenges comes with managing data. Data is inert; it is not self-organized or even self-understandable.
Therefore how do we manage the data? How do we organize an attach meaning so that the data can easily be used by the business, or a computer program?
The DIKW pyramid, provides a simple visualisation of the value chain growing from data to Wisdom / integrated knowledge:
- Data is the base with the least amount of perceived usefulness.
- Information has higher value than data.
- Knowledge has higher value than information.
- Wisdom has the highest perceived value of all.
To move up the value chain data
requires something else such as a program, a machine, or even a person—to add understanding to it so that it becomes Information
.
By organizing and classifying information, the value chain expands from data and information to be regarded as knowledge.
At the top of the data value chain is Wisdom. Wisdom comes from a combination of inert data, which is the fundamental raw material in the modern digital age, combined with a series of progressive traits such as:
- perspective.
- context.
- understanding.
- learning.
- the ability to reason.
Data progression¶
Any AI/ML project includes the following phases to create this valuable knowledge:
- Collect.
- Organize.
- Analyze.
- Infuse.
The AI solution progresses through the levels to infuse, a state of capability that means an enterprise has taken artificial intelligence beyond a science project. Infusion means that advanced analytical models have been interwoven into the essential fabric of an application or system whereby driving new or improved business capabilities.
Collect – Making Data Simple and Accessible¶
The first step is Collect
, a primitive action that serves as the first element towards making data actionable and to help drive automation, insights, optimization, and decision-making. Collect is an ability to attach to a data source – whether transient or persistent, real or virtual, and while being agnostic as to its actual location or its originating (underlying) technology. In linking to the DIKW pyramid we could say that, data lies below the first rung, recognizing the inert nature of data.
Properties of data include:
- Structured, semi-structured, unstructured
- Proprietary or open
- In the cloud or on-premise
- Any combination above
Organize – Trusted, Governed Analytics¶
The second step is Organize
and is about how an enterprise can make data known, discoverable, usable, and reusable. The ability to organize is prerequisite to becoming data-centric. Additionally, data of inferior quality or data that can be misleading to a machine or end-user can be governed in such that any use can be adequately controlled. Ideally, the outcome of Organize
is a body of data that is appropriately curated and offers the highest value to an enterprise.
Organize allows data to be:
- Discoverable.
- Cataloged.
- Profiled.
- Categorized.
- Classified.
- Secured (e.g. through policy-based enforcement)
- A source of truth and utility
Analyze – Insights On-Demand¶
The Analyze
step is about how an organization approaches becoming a data-driven enterprise. Analytics can be human-centered or machine-centered. In this regard the initials AI can be interpreted as Augmented Intelligence when used in a human-centered context and Artificial Intelligence when used in a machine-centered context. Analyze covers a span of techniques and capabilities, from basic reporting and business intelligence to deep learning.
Analyze, through data, allows to:
- Determine what has happened
- Determine what is happening
- Determine what might happen
- Compare against expectations
- Automate and optimize decisions
Infuse – Operationalize AI with Trust and Transparency¶
The last Infuse
step is about how an enterprise can use AI as a real-world capability. Operationalizing AI means that models can be adequately managed which means an inadequately performing model can be rapidly identified and replaced with another model or by some other means. Transparency infers that advanced analytics and AI are not in the realm of being a dark art and that all outcomes can be explained. Trust infers that all forms of fairness transcend the use of a model.
Infuse
allows data to be:
- Used for automation and optimization
- Part of a causal loop of action and feedback
- Exercised in a deployed model
- Used for developing insights and decision-making
- Beneficial to the data-driven organization
- Applied by the data-centric enterprise