November

10

DATA-DRIVEN, BUT FIRST WE MUST TACKLE THE ENTERPRISE DATA QUALITY CHALLENGE

SIZE OF THE PRIZE

Delighting customers, beating your competitors, understanding where your revenue and costs come from all depend on one thing – making the right decisions. Competitive advantage stems from making a series of correct long-term and short-term decisions.  

What to invest in, what to build, who to hire. In the words of Laghley and Martin – “Where to play and how to win” (topic for a separate article)

As we evolve into a more connected world based there will be a proliferation of connected devices supported by 5G. At the end of 2019, there were approximately 27 billion devices connected to the internet (around 3.5 devices per person), this is forecast to hit 75 billion by 2025 (with a forecast population of 8.1 billion, that’s over 9 devices per person, a 2.6x rise in the number of devices per person. (source statista)

It’s safe to say, the digital world can be characterised by ubiquitous internet, IoT/connected devices, and AI wrapped up in the paradigm of web 3.0 (or the intelligent web). (Definitely requires another article to deep dive into this).  

To survive and compete effectively in this world, your organisation must be able to make the right decisions, at pace.  

Davenport & Harris in their 2007 book “Competing on Analytics” states the various stages analytical maturity, moving from backwards looking (what happened) to predictive modelling (if trends continue, what is likely to happen) and prescriptive (what should I be doing and how do I make it happen).

The McKinsey Global Institute predicted in 2018 that the potential value that can be generated (or in my view displaced) through AI in the global economy annually is between 3.5 to 5.8 trillion USD. Machine learning, a branch of AI relies on data to train algorithms that learn for themselves. Future analytical and decision models are going to be increasingly machine-driven rather than rules based. We’ve all experienced voice-recognising assistants such as Siri, Alexa and … “hey Google” and in general they are pretty good. However, this was a very challenging field of computing for a very long time, software such as Dragon naturally is voice recognition software that’s been around since 1997, but it wasn’t until the recent incorporation of machine learning that has brought significant improvements

Leading organisations in the digital era of the 21st century will be data-driven. This means their data is their competitive differentiator, that is, they have built some offering that induces customer engagement or interaction and in return they know something about the marketable customer base that their competitors don’t. Take Netflix as an example, they know what each subscriber (and their best friends and cousins) watch, how long their watch each show for. This in turn allows Netflix to produce television shows or movies where they have a precise knowledge of how many people will be interested in that show. This is a drastic step up from traditional surveys or using meters in a small sample set to measure television audiences.

Source: Adapted from Davenport & Harris - Competing on Analytics
Source: Adapted from Davenport & Harris - Competing on Analytics

As we’ve said previously, competing is about making the right decisions, however the quality of your decisions will depend on the quality of your facts. Therefore, organisations need to consider the quality of the data as a stable foundation on which to build analytical capabilities and workloads.

Organisations need to consider their data management maturity in conjunction with their analytical maturity. This means being able to grapple with the engineering challenge of organising data for use as well as being able to model the data to drive useful actions and outcomes.

TACKLING THE ENTERPRISE DQ CHALLENGE

Poor data quality is a symptom of gaps in systems or processes. We all take care to design systems to fit human behaviour, but sometimes people just like using free text fields and that becomes the adopted convention. DQ remediation is the medic or the ambulance at the bottom of the cliff that patches things up, however crucial to this capability is the ability to undertake root cause analysis and feed changes back up into overall IT system improvement and process assurance practics.

At Cognitivo, we take an Enterprise Architecture aligned approach to DQ planning. The Enterprise Data Quality problem is a colossal challenge, given data quality can be the defects due to any process being executed over numerous systems by a wide variety of people.

To prioritise Data Quality efforts, we look to the Enterprise Risk Matrix. This is a 2 x 2 matrix of the material risks facing your organisation. From there, we document the high level process flows (both from a customer journey and value chain perspective). The customer journey defines what the customer expects, and the value chain defines how your company organises to deliver (based on departmental functions - afterall a factory / specialisation is still the most efficient configuration to produce goods and services).

From there we define a business conceptual data model that is abstract / semantic in nature and map the processes to the conceptual data entities and attributes. This is important because it provides a common process and data language which we can then correlate to each systems (that will have different names for various data attributes). During this phase we map in as many data inputs (e.g. regulatory reports, management reports and decompose the metrics that go into those reports down to their constituent data attributes).

Next, we map the data flow for each process (and associated data concepts) through the enterprise integration landscape (and at a technical level, we may perform a detailed data lineage mapping - although this is not crucial to perform analytical DQ checks). Once this is done, we catalogue this information (data sets / involved processes, access rights etc) into an information asset register or a source catalogue.

DATA QUALITY EXECUTION

DQ Execution Lifecycle.png

DQ Execution sits within an overall Data / IT planning & prioritisation process.

  1. DQ issues raised are prioritised in-line with the organisation’s risk profile.
  2. Issues are diagnosed, quantified, corrected and monitored.
  3. Through root cause analysis, change / enhancements are implemented.

The overall process should between your organisation’s business / quality assurance process and your system change / continuous improvement process, linking the 2 together.

The execution workflow will naturally and logically follow a lean DevOps execution approach of continuous execution of:

DQ Process Workflow.png
  1. Business stakeholders raising DQ issues
  2. Validate issues within an analytical environment
  3. Obtain correct values (and obtain sign-off from data owners)
  4. Deploy fixes into production
  5. Deploy ongoing control checks to monitor issues
  6. Incorporate checks into relevant QA KPI’s

Analytics & AI-driven DQ methods

DQ Analytical Environment.png

In today’s Big Data era where data has increased velocity, variety and volume. Data Quality needs to keep up, it needs to be more machine drive and leverage modern data engineering platforms that can support “active data risk management” at the pace required.

This means ingesting relevant data, including the creation of new data extracts specifically for data quality testing to an analytical platform that can support batch and if required real time analytical workloads.

That data platform must also include a rules engine for the execution of semantically driven DQ rules. Compared to our prior experience based on Enterprise-on-premise deployments, current execution times and costs have been reduced drastically through our use of cloud data platforms and open source analytical tools. An expensive enterprise DQ tool will not help you build relevant data quality business rules aligned to your business processes - there’s no free lunch here.

Data stewards similarly need to be able to create (via a simple graphical user interface) new data quality rules based on the established business conceptual model, but be support by higher skilled data analysts to build reusable analytical functions. All this needs to be done in a collaborative DQ / Analytics workspace.

DQP Screenshot.png

Tests for Data Quality

DQ Tests.png

Within the analytical DQ platform data stewards will need to deploy a number of tests, these include:

  • Validity tests using regex / pattern matching
  • Record count anomaly checks
  • Independent check-sums to reconcile that data flows are complete (e.g. account balances in transactional systems vs compliance reporting systems such as AML/CTF)
  • Customer / entity matching across internal systems as well as 3rd party data for data enrichment
  • Correlating data between structured and unstructured data sources using OCR, text analytics and computer vision (we have experience performing asset data enrichment using 360 degree photos for local governments)
  • Reasonable value check (record anomaly / outlier, value drift over time)  

SUMMARY

Cognitivo’s approach to data quality is to embed DQ within a broader organisational risk (and data risk management approach). This allows for DQ to support existing quality assurance processes, rather than stand up new processes and resources.

The principles of Cognitivo’s DQ approach are:

  • Risk & Policy Based – Identify key processes that possess material data risk as prioritised areas to perform DQ diagnosis and treatment
  • Process (use-case) Centric – Identify data flows that underpin key processes and address data quality across the entire system data flow
  • Metadata Driven – Development or use of a conceptual data model as an abstraction layer to work with business stakeholders to agree definitions and business rules that is subsequently mapped to physical data models
  • Analytics & ML Enabled – use of data science techniques (such as ML, text analytics, vision) to build industry and organisation specific data matching and data quality diagnosis techniques
  • Embedded in Business-As-Usual – Roll out of DQ controls, measurement (dashboard) as part of the organisation’s quality assurance processes, rather than constructing new data KPI and consequence management framework

recent posts

AI-POWERED ALGORITHM FOR STREET SIGNS DETECTION V2

DATA-DRIVEN, BUT FIRST WE MUST TACKLE THE ENTERPRISE DATA QUALITY CHALLENGE

YOU MUST DEAL WITH RE-IDENTIFICATION RISK BEFORE SHARING DATA BUT YOUR PRIVACY IMPACT ASSESSMENTS ARE INADEQUATE

INNOVATION IN LOCAL GOVERNMENT