Active Metadata Management and The Rise of Intelligent Data Architecture Platforms

Today, organizations cannot afford to wait for data insights, as they need to focus on meeting business needs and delivering results at the speed of decision-making. However, many data professionals have been overly focused on technology, which can lead to suboptimal and costly choices. To address this, many are adopting a business-outcome-first mindset . However, this shift necessitates not only a different thought process, but also a fresh technology slant. A new alternative, called an “Intelligent Data Architecture Platform” (IDAP), is an approach that accomplishes this by unifying data and metadata, resulting in the faster development of data products.

As an intelligent data orchestrator, IDAP utilizes Machine Learning (ML), and undergirds the metadata collection and discovery needed to perform the required tasks. Here, the metadata powers the automation and orchestration backplane, creating a unified engine that enables data and business teams to build and manage data products in a collaborative manner. Taking it one step further is a process known as active metadata management (AMM). Unlike traditional metadata management, AMM analyzes metadata and delivers timely alerts and recommendations for addressing issues like data pipeline failures and schema drifts as needed. This proactive approach also ensures a healthy and updated modern data stack.

More specifically, IDAP includes the following components that work together:

Ingestion and Profiling: Data ingestion is the process of importing or receiving data from various sources into a target system or database for storage, processing, and analysis. The involves extracting data from source systems, transforming it into a usable format, and loading it into the target system and a critical step in creating a reliable and efficient data pipeline. Some data is ingested in batch mode using data movement options like secure FTP, and some sources allow real time ingestion using pub/sub mechanisms like Apache Kafka or APIs. The IDAP needs to not only manage varying frequencies on when to ingest the data, but also discover its schema and handle changes, like schema drift. Once done, data from operational and transaction sources is loaded into a data warehouse or a data lake where it is then integrated and modeled for consumption by downstream systems and data consumers. However, before this data can be used intelligently, it needs to be profiled.

Conventional systems have provided mechanisms to profile ingested data and extract technical metadata, such as column statistics, schema information and basic data quality attributes, like completeness, uniqueness, missing values to create technical metadata, etc. IDAP does this too, but also uses ML to build a knowledge graph, so it can infer relations and data quality rules. The approach also helps generate operational metadata, which is information on how and when data was created or transformed.

Traditionally, activating metadata, was seen as a static resource, created and stored alongside the data it describes. However, with the increasing complexity and volume of data in modern systems, active metadata management has become essential. It involves treating metadata as a dynamic and valuable asset that can be actively leveraged for various purposes. IDAP activates the metadata so it can travel across modern data tool stacks and actively manage all data workloads. IDAP uses metadata analysis to provide recommendations to data engineers so they can effectively manage data pipelines, alert data quality issues to increase productivity, and ensure good data delivery to data consumers.
Curation: Data curation involves the selection, organization, and maintenance of data to ensure its accuracy, reliability, and usefulness for analysis and decision-making. It involves activities such as data cleansing, transformation, and enrichment, as well as metadata creation and documentation. Effective data curation is essential to normalize, standardize, and harmonize datasets to deliver successful data-driven projects.

To speed up business-led data product development, the technical metadata - which is comprised of technical column names - is converted into business-friendly terms to create business metadata. In this step, the business metadata is linked to technical metadata and added to the business glossary.
Data Quality: Embedding quality checks into data pipelines addresses data inaccuracy, duplication, and inconsistency. By offering this capability, IDAP delivers exceptional data products while enhancing the reliability of data for organizations.

Transformation/Testing: This is designed to provide an excellent developer experience to help boost productivity. Here, a collaborative workspace is utilized to develop and deploy code as the IDAP borrows best practices from software engineering of agile and lean development, including reusability of the data transformation code.
Additionally, it uses a no/low code transformation engine that can be built-in to the IDAP or integrated with an existing engine to speed up development. Finally, it applies key components of the DevOps philosophy such as continuous testing and automation to data management. The described discipline is called DataOps, and it is fast maturing.
Continuous Development and Deployment: DataOps best practices are utilized in deployment to push the code into production in a governed and secure manner. This allows business users to accelerate experimentation by branching and testing new features without introducing breaking changes into the production pipelines. Features can also be rolled back quickly if needed. Finally, the IDAP introduces the much-needed A/B testing capabilities into the development of data products.

Observability: IDAP uses ML to detect anomalies and has an alerting and notification engine to escalate critical issues. Traditional systems were rule-based and led to a large number of notifications causing “alert fatigue”. Modern observability systems leverage ML to detect anomalies and have an alerting and notification engine to escalate critical issues. The process allows the business to proactively determine anomalies to avoid downtime, while also handling notifications intelligently to reduce the overload.

Building Better Business Value Begins by Being “Business Led”

The future belongs to organizations that are led by business-outcomes, rather than being driven by technology. These companies are laser-focused on delivering business value at all times and have an urgency to transform fast, quickly stand-up analytics use cases, and continuously innovate. However, this often requires adopting a hybrid approach that integrates the best of centralized infrastructure with domain-driven data product development. It also needs to lead with the user experiences/needs in mind. As a result, this method helps deliver results faster and aligns well with organizational culture and skills, creating solutions with more value to clients/customers.

Partners who provide an integrated platform that supports active metadata management save their customers time and money while also delivering trusted business outcomes. The time saving comes from avoiding the need to integrate several technologies and by making the business significantly more efficient. For example, organizations can easily measure the benefits such as the ratio of successful projects, deployed use cases, and the frequency of new releases resulting in a higher trust in data. They can also leverage the approach to create economies of scale and to avoid unnecessary downtime.

Finally, these products gain from economies of scale, and like an ML model gets better by retraining itself frequently, so do these cloud-native multi-tenant data frameworks. By flipping the focus from technology to outcomes, organizations that consider IDAP are finally achieving the aspirational goal of becoming truly data driven.

About Modak

Modak is a solutions company dedicated to empowering enterprises in effectively managing and harnessing their data landscape. They offer a technology, cloud, and vendor-agnostic approach to customer datafication initiatives. Leveraging machine learning (ML) techniques, Modak revolutionizes the way both structured and unstructured data are processed, utilized, and shared.

Modak has led multiple customers in reducing their time to value by 5x through Modak’s unique combination of data accelerators, deep data engineering expertise, and delivery methodology to enable multi-year digital transformation. To learn more visit or follow us on LinkedIn and Twitter.