Despite the great potential of AI and the large investments in AI technologies undertaken by industrial enterprises, AI has not yet delivered on the promises in industry practice. The core business of industrial enterprises is not yet AI-enhanced and AI-enabled. AI solutions instead constitute islands for isolated cases—such as the optimization of selected machines in the factory—with varying success.
According to current industry surveys, data issues constitute the main reasons for the insufficient adoption of AI in industrial enterprises.
In general, it's nothing new that data preparation and data quality are key for AI and data analytics, as there is no AI without data. This has been an issue since the early days of business intelligence (BI) and data warehousing. However, the manifold data challenges of AI in industrial enterprises go far beyond detecting and repairing dirty data.
The business of industrial enterprises comprises the engineering and manufacturing of physical goods—for instance, heating systems or electrical drives. For this purpose, industrial enterprises typically operate a manufacturing network of various factories organized into business units.
The IT landscape of industrial enterprises usually comprises different enterprise IT systems, ranging from enterprise resource planning (ERP) systems over product lifecycle management (PLM) systems to manufacturing execution systems (MES).
In Industry 4.0 and Internet of Things (IoT) applications, industrial enterprises push the digitalization of the industrial value chain. The aim is to integrate data across the value chain and exploit it for competitive advantage.
Hence, the AI enablement of processes and products is of strategic importance. To this end, industrial enterprises have, in recent years, built data lakes, introduced AI tools, and created data science teams.
In this article, I shed some light on some data challenges undermining enterprise-wide AI success and how we can address them in order to scale AI in the enterprise.
Why adopting AI outside of tech is so hard
Why isn’t AI widely used outside consumer internet companies? The top challenges facing AI adoption in other industries include:
1. Small datasets. In a consumer internet company with huge numbers of users, engineers have millions of data points that their AI can learn from. But in other industries, the dataset sizes are much smaller. For example, can you build an AI system that learns to detect a defective automotive component after seeing only 50 examples? Or to detect a rare disease after learning from just 100 diagnoses?
Techniques built for 50 million data points don’t work when you have only 50 data points.
2. Cost of customization. Consumer internet companies employ dozens or hundreds of skilled engineers to build and maintain monolithic AI systems that create tremendous value — say, an online ad system that generates more than $1 billion in revenue per year.
But in other industries, there are numerous $1-5 million projects, each of which needs a custom AI system. For example, each factory manufacturing a different type of product might require a custom inspection system, and every hospital, with its own way of coding health records, might need its own AI to process its patient data.
The aggregate value of these hundreds of thousands of these projects is massive; but the economics of an individual project might not support hiring a large, dedicated AI team to build and maintain it. This problem is exacerbated by the ongoing shortage of AI talent, which further drives up these costs.
3. Gap between proof of concept and production. Even when an AI system works in the lab, a massive amount of engineering is needed to deploy it in production. It is not unusual for teams to celebrate a successful proof of concept, only to realize that they still have another 12-24 months of work before the system can be deployed and maintained.
For AI to realize its full potential, we need a systematic approach to solving these problems across all industries.
The data-centric approach to AI, supported by tools designed for building, deploying, and maintaining AI applications — called machine learning operations (MLOps) platforms — will make this possible.
Companies that adopt this approach faster will have a leg up relative to competitors.
Insular AI: AI is performed in islands
Organizations have implemented a wide variety of AI use cases across the industrial value chain: from predictive maintenance for IoT-enabled products over predictive quality for manufacturing process optimization to product lifecycle analytics and customer sentiment analysis.
On one hand, insular AI fosters the flexibility and explorative nature of use-case implementations. On the other hand, it hinders reuse, standardization, efficiency, and enterprise-wide application of AI. The latter is what we call "industrialized AI."
AI has not yet delivered on the promises in industry practice. The core business of industrial enterprises is not yet AI-enhanced.
Insular AI leads to a globally distributed, polyglot, and heterogeneous enterprise data landscape.
According to current industry surveys, data issues constitute the main reasons for the insufficient adoption of AI in industrial enterprises.
Consequently, to industrialize AI requires a systematic analysis of the underlying data challenges. On this basis, an overall solution integrating technical and organizational aspects can be designed to address the challenges. Let us explore some of these data challenges.
Data Challenges of AI
Generally, ensuring data quality for AI is important—for instance, by detecting and cleansing dirty data. Such data quality issues have already been addressed by a plurality of works and tools.
Beyond data quality, however, exist further critical data challenges—data management, data democratization, and data governance for AI.
In contrast to classical BI and reporting, machine learning and data mining impose extended data requirements. They favor the use of not only aggregated, structured data but also of high volumes of both structured and unstructured data in its raw format—for example, for machine learning-based optical inspection.
This data also needs to be processed not only in periodic batches but also in near real-time to provide timely results—for instance, to predict manufacturing quality in real-time. Consequently, AI poses new challenges to data management, data democratization, and data governance as detailed in the following.
The data management challenge of AI lies in comprehensively managing data for AI in a heterogeneous and polyglot enterprise data landscape. This particularly refers to data modeling, metadata management, and data architecture for AI. There is no overall metadata management to maintain metadata across the data landscape.
Technical metadata, such as the names of columns and attributes, are mostly stored in the internal data dictionaries of individual storage systems and are not generally accessible. Hence, data lineage and impact analyses are hindered. For instance, in the case of changes in source systems, manually adapting the affected data pipelines across all data lakes without proper lineage metadata is tedious and costly.
Moreover, business metadata on the meaning of data—for example, the meaning of KPIs—is often not systematically managed at all. Thus, missing metadata management significantly hampers data usage for AI.
The data democratization challenge of AI lies in making all kinds of data available for AI for all kinds of end users across the entire enterprise. To this end, data provisioning and data engineering as well as data discovery and exploration all play central roles for AI.
The data governance challenge of AI refers to defining roles, decision rights, and responsibilities for the economically effective and compliant use of data for AI. According to our practical investigations, organisational structures for data are only rudimentarily implemented in industrial enterprises and mainly focus on master data and personal data.
Building a Data Ecosystem for Industrial Enterprises
In light of the above data challenges, I see the need for a holistic framework that covers both technical and organizational aspects to address the data challenges of AI. Generally, the data ecosystem paves the way to industrialized AI by addressing the data challenges.
The term data ecosystem refers to the programming languages, packages, algorithms, cloud-computing services, and general infrastructure an organization uses to collect, store, analyze, and leverage data.
The concept of the data ecosystem is explored through the lens of key stages in the data project life cycle: sensing, collection, wrangling, analysis, and storage.
By understanding how each component of your organization’s data ecosystem interacts with other components, you can prepare for these kinds of challenges and identify opportunities for efficiency.
With respect to the data management challenge, the data ecosystem is based on a comprehensive set of data platforms, namely the enterprise data lake, edge data lakes, and the enterprise data marketplace. These platforms define an enterprise data architecture for AI and data analytics, specifically addressing the aspect of data architecture.
For this purpose, the enterprise data lake incorporates the enterprise data warehouse, avoiding two separate enterprise-wide data platforms and corresponding data redundancies. It is based on a unified set of data modeling guidelines and reference data models implemented by data stewards to address the aspect of data modeling.
The aspect of metadata management is addressed by the data catalog as part of the enterprise data marketplace. The data catalog focuses on the acquisition, storage, and provisioning of all kinds of metadata—technical, business, and operational—across all data lakes and source systems.
In this way, it enables overarching lineage analyses and data quality assessments as essential parts of AI use cases—for example, to evaluate the provenance of a dataset in the enterprise data lake. Data catalogs represent a relatively new kind of data management tool and mainly focus on the management of metadata from batch storage systems—such as relational database systems as detailed in our recent work.
All aspects of the data democratization challenge—namely data provisioning, data engineering, and data discovery and exploration—refer to self-service and metadata management. They are addressed by the enterprise data marketplace based on the data catalog.
The data catalog provides comprehensive metadata management across all data lakes and source systems of the data ecosystem. Thus, it significantly facilitates data engineering as well as data discovery and exploration for all kinds of end users by providing technical and business information on data and its sources as discussed in our recent work.
As data complexity and volumes increase, it becomes increasingly important to build a universal semantic layer so that your business users get a consistent view of all enterprise data and can conduct quick analysis on it. Once you get all your data together and build a semantic layer on it, you enable full, consistent, and quick access to a single source of truth.
This ensures that when one team talks about a particular dimension, then everybody across the enterprise refers to the same thing. Having a high-performant semantic layer in place will allow your business users to take advantage of the data more quickly to get actionable insights from all their data.
The semantic layer takes the user-specific results out of being a “one-off” solution on that user’s laptop to becoming an enterprise analytics accelerant, enabling business answer discovery at the speed of business questions. Insights discovery for everyone is achieved. The semantic layer becomes the arbiter (multi-lingual data translator) for insights discovery between and among all business users of data, within the tools that they are already using.
In view of the data governance challenge, the data ecosystem defines a set of key roles related to data—namely data owners, data stewards, data engineers, and data scientists. Thus, both aspects—data ownership and data stewardship—are addressed.
An overarching data ownership organization across source systems and data lakes facilitates the compliant and prompt provisioning of source data for AI use cases because approvals and responsibilities for the use of data are clearly defined. Moreover, a data stewardship organization for all kinds of data significantly enhances data quality and reduces data engineering efforts by establishing reference data models and data quality criteria.
At this, the data catalog supports data governance by providing KPIs for data owners and data stewards, such as the number of sources of truth for specific data sets.
Model-centric to Data-centric Artificial Intelligence approach to AI
Despite the vast potential of artificial intelligence (AI), it hasn’t caught hold in most industries. Sure, it has transformed consumer internet companies such as Google, Meta, Baidu, Apple, and Amazon — all massive and data-rich with hundreds of millions of users.
But for projections that AI will create $13 trillion of value a year to come true, industries such as manufacturing, agriculture, and healthcare still need to find ways to make this technology work for them.
Here’s the problem: The playbook that these consumer internet companies use to build their AI systems — where a single one-size-fits-all AI system can serve massive numbers of users — won’t work for these other industries.
Instead, these legacy industries will need a large number of bespoke solutions that are adapted to their many diverse use cases. This doesn’t mean that AI won’t work for these industries, however. It just means they need to take a different approach. we need to take a data-centric approach to AI.
Data is vital in AI; and adopting an approach to obtain good-quality data is crucial—because useful data is not just error-prone and limited, but also very costly to obtain.
The key idea of data-centric AI is to handle data the same way we would high-quality materials when building a house: We spend relatively more time labeling, augmenting, managing, and curating the data.
To bridge this gap and unleash AI’s full potential, executives in all industries should adopt a new, data-centric approach to building AI. Specifically, they should aim to build AI systems with careful attention to ensuring that the data clearly conveys what they need the AI to learn. This requires focusing on data that covers important cases and is consistently labeled so that the AI can learn from this data what it is supposed to do.
In other words, the key to creating these valuable AI systems is that we need teams that can program with data rather than program with code.
Data-centric AI prioritizes data quality over quantity. Compared to model-centric AI, which seeks to engineer performance gains by expanding data sets, a data-centric approach can help mitigate many of the challenges that can arise when deploying AI infrastructure.
At AI’s current level of sophistication, the bottleneck for many applications is getting the right data to feed to the software. We’ve heard about the benefits of big data, but we now know that for many applications, it is more fruitful to focus on making sure we have good data — data that clearly illustrates the concepts we need AI to learn. This means, for example, the data should be reasonably comprehensive in its coverage of important cases and labeled consistently.
Data is food for AI, and modern AI systems need not only calories, but also high-quality nutrition.
Shifting our focus from software to data offers an important advantage: it relies on the people you already have on staff. In a time of great AI talent shortage, a data-centric approach allows many subject matter experts who have a vast knowledge of their respective industries to contribute to AI system development.
When you want a complete solution that can scale with your needs and your customer’s requirements after the POC phase while maintaining high accuracy, I found it very important to take a data-centric approach.
With a data-centric approach, instead of relying on the model to find inconsistencies in your data via trial and error, you design a system based on the characteristics of data, then use that to train your models. With a data-centric approach, you can still make use of state-of-the-art models with an optimized architecture, but the quality of the data is more important than the quantity.
What’s keeping organizations from unlocking the full value of their data? The answer lies in the traditional ways of viewing data and data management practices which are incomplete for the digital era. We need to reframe the importance of data in driving the AI-empowered organization.
Data is not just a byproduct of the business or something that is IT’s problem to manage. It’s the fuel for economic growth in the 21st century and what differentiates the Analytic Leaders from the masses with unrealized profits.
A holistic data management strategy understands and exploits the unique economic characteristics of data and analytics and takes management action to avoid pitfalls or destroyers of the economic value of data and analytics.
If organizations believe that data is the world’s most valuable resource, then data management must move from a passive, cost-avoidant set of technology functions into a proactive, data monetization business strategy.
Also, here are a few things to keep in mind when approaching an AI project:
- Instead of merely focusing on the quantity of data you collect, also consider the quality, and make sure it clearly illustrates the concepts we need the AI to learn.
- Make sure your team considers taking a data-centric approach rather than a software-centric approach. Many AI engineers, including many with strong academic or research backgrounds, were trained to take a software-centric approach; urge them to adopt data-centric techniques as well.
- For any AI project that you intend to take to production, be sure to plan the deployment process and provide MLOps tools to support it. For example, even while building a proof-of-concept system, urge the teams to begin developing a longer-term plan for data management, deployment, and AI system monitoring and maintenance.
As the undeniable fuel for economic growth in the 21st century and the differentiator that sets analytic leaders apart from the masses, data and efficient data management need to become a core business priority.
It’s time for a shift in both perspective and implementation to unleash the true and unrealized economic value of your business’ data. It’s the right time to bring the focus back to data-centric AI in this decade—cruise through the road less traveled—and cover the gap with advancements in model-centric AI.
- "There Is No AI Without Data" by By Christoph Gröger:
- The Ugly Truth about Data Management & the Journey to Unleashing the Economic Value of Data
- The SAS data governance framework: A blueprint for success. SAS (2018).
- Value creation from big data: Looking inside the black box. Strategic Organization 16, 2 (2018), 105–140.
- State of AI in the enterprise, 2nd edition. Deloitte, (2018)
- Landing AI - Andrew Ng