Everybody knows silos are bad for business

It is evident that 99% of businesses, irrespective of big or small, is bleeding heavily due to data silos. The current environment demands huge dollars to be spent in order to maintain data, which covers paying for redundant data and isolated data due to data sources that don’t provide a common unique parameter for unification. Data grows on a daily basis and there is no escape from the cost incurred on growing redundancies and data silos. This sub-optimal approach not only costs a lot of money but also create gaps in achieving the desired results.

Problems associated with data silos can be many. For example, if you are facing the problem of higher customer acquisition costs, you can see that this is due to (a) inaccurate attribution models leading to ineffective ad spends and (b) lack of customer engagement or irrelevant targeting. Both these problems are resultants of data available across many data sources and lack of unique tags to merge them. So, the analysis carried out over individual data-sets results in wrong analysis, which results in low effectiveness and wasted resources.

Not maintaining a single source of truth, can even result in managing and holding on to customers. Customer retention campaigns can be less effective if customer data is all over the place and lack of data can lead to irrelevant messaging. Maintaining a single source allows the retention marketer to track costs associated with each campaign towards a single customer, which helps them automate the right discount/offers based on every associated parameter.

These problems arrive due to insights or metrics derived from analytical modules which feed off these structured sources (EDW). However, unstructured data (data lake), coming from web tracking, log files, social data, still pose a problem in merging the data-sets with the structured set, in which case the analytical modules will run on two different sources.

To dig this problem further, the structured data holds the identifiable customer information like profile data, email interactions, transactions, social data by resolving emails, and post-login web/app behavior. The unstructured data holds unidentifiable customer information like website behavior of a customer when they chose to browse without logging in or audience profile information coming from an ad network or other data collected over cloud apps. Due to lack of unique tags, data within the data-lake requires meticulous effort to achieve a unification within.

Since these data sources cannot be merged, the analytical tools miss out on the information from the unstructured database and arrive at insights that are less accurate. This kind of approach is prevalent in most enterprises, where most of your point applications that offer structured information are ETL’d to your data warehouse, and the unstructured data is preferred to be on Hadoop, running streams and batch processing tools to ingest into the lake for transformation.


This disconnected data create gaps in data harmonization, analytics, discovery, mining or any machine learning activities. This data processing layer or apps need to be provided with complete data-sets to get the best result out of this investment. This layer becomes mission-critical to deploying decision processes for machine driven automation and AI enabled conversations and engagement.

With data being processed across various systems, serving real-time engagements to customers becomes impossible, making customer engagement irrelevant and less effective. The insights derived from post hoc analysis might have lost relevance by the time it reaches the customer. For example, If a customer is recommended a product from post hoc analysis, and the customer decides to buy the product. The customer adds it to the cart and continues to browse. If this was detected, the product should not be recommended for the rest of the session. An AI-driven system should be able to detect this and behave naturally but it needs the data to be in one place so it can update when an interaction occurs and learn from it.

In current practice, the data-sets flushed into the data-lake is as good as the raw data, in the sense that the data relationships are still not created. Data input has already made two hops (source database and to data lake) and will still need another hop to draw out relationships. This data further needs to be processed for insights (another hop) and finally delivered to the end user. Traversing so many applications is like going away from real-time. Like Sean Owen of Cloudera puts it, “Real-time systems compute a function of one data element, or a smallish window of recent data asynchronously in near-real-time — probably seconds at most

Along with challenges of plugging the complex machinery of your data environment, comes the cost of time and money spent on tools, integrations, heuristic models, to get inadequate outputs.

Ideally, businesses need a platform like Plumb5 that can pack both the warehouse and data lake into one unified source, which can be used to feed off analytical or machine learning modules for better results. The unified source will act as the global metadata repository and create a semantic relationship between them. For example, both unstructured and structured data of a single customer can be unified using a unique tag served at point-systems


Business can connect their data sources directly to the Plumb5 platform using APIs or connect their DW or Datalake before they run any of their analytical apps. Plumb5 creates a unified set for each individual customer and this data can be served to third-party analytical models or ML libraries or use the inbuilt ML library to arrive at predictive insights or to achieve decision states for machine automation.

Though tagging pretty much solves the unification problem, it is the design of the unified model that takes precedence, when we have to solve other factors like real-time learning, state automation, multi-dimensional querying or even an NLP based data search.

The unified model in Plumb5 chooses a hierarchical logical schema which can be doubled up for a network for real-time learning. This will allow the unified source to compute data in real-time using a propensity scoring model and come up with outputs instantly. These outputs, which are net-weights computed from weights from the scoring model, indicates the state the customer belongs to. Based on these states, machine automation can be configured.


Using this customer-centric schema, the platform allows for quick recommendation using collaborative filtering, as the customer filter is applied at the schema level. This allows in proposing real-time recommendation, which is packed alongside the states, to be communicated during the automation cycle.

With Plumb5, the business can turn the entire data chaos into an organized structure holding historical customer information with up-to-date information, providing a perfect foundation for any data-driven analysis and learning.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s