Do you need Data Governance over a Data Lake?

There continues to be a lot of excitement about data lakes and the possibilities that they offer, particularly about with analytics, data visualizations, AI and machine learning. As such, I’m increasingly being asked whether you really need Data Governance over a data lake.  After all, a data lake is a centralised repository that allows you to store all your structured and unstructured data on a scalable basis.

Unlike a data warehouse, in a data lake you can store your data as-is without having to structure it first.  This has resulted in many organisations “dumping” lots of data into data lakes in an uncontrolled and thoughtless manner.  The result is what many people are calling “Data Swamps” which have not provided the amazing insights they hoped for.

So the simple answer to the question is yes – you do need Data Governance over data lakes to prevent them from becoming data swamps that users don’t access because they don’t know what data is there, they can’t find it, or they just don’t trust it.  If you have Data Governance in place over your data lake, then you and your users can be confident that it contains clean data which can found and used appropriately.

But I don’t expect you to just take my word for it; let’s have a look at some of the reasons why you want to implement Data Governance on data being ingested into your data lake:

Data Owners Are Agreed

Data Owners should be approving whether the data they own is appropriate to be loaded to the Data Lake e.g. is it sensitive data, should it be anonymised before loading?

In addition, users of the data lake need to know who to contact if they have any questions about the data and what it can or can’t be used for.

Data Definitions

Whilst data definitions are desirable in all situations, they are even more necessary for data lakes.  In the absence of definitions, users of data in more structured databases can use the context of that data to glean some idea of what the data may be.  As a data lake is by its nature unstructured, there is no such context.

A lack of data definitions means that users may not be able to find or understand the data, or alternatively use the wrong data for their analysis.  A data lake could provide a ready source of data, but a lack of understanding about it means that it can not be used quickly and easily. This means that opportunities are missed and use of the data lake ends up confined to a small number of expert users.

Data Quality Standards

Data Quality Standards enable you to monitor and report on the quality of the data held in the data lake.  While you do not always need perfect data when analysing high volumes, users do need to be aware of the quality of the data. Without standards (and the ability to monitor against them) it will be impossible for users to know whether the data is good enough for their analysis.

Data Cleansing

Any data cleansing done in an automated manner inside the data lake needs to be agreed with Data Owners and Data Consumers. This is to ensure that all such actions undertaken comply with the definition and standards and that it does not cause the data to be unusable for certain analysis purposes— e.g. defaulting missing date of births to an agreed date could skew an analysis that involved looking at the ages of customers.

Data Quality Issue Resolution

While there may be some cases where automated data cleansing inside the data lake may be appropriate, all identified data quality issues in the data lake should be managed through the existing process to ensure that the most appropriate solution is agreed by the Data Owner and the Data Consumers.

Data Lineage

Having data flows documented is always valuable, but in order to meet certain regulatory requirements, (including EU GDPR) organisations need to prove that they know where data is and how it flows throughout their company.

One of the key data governance deliverables are data lineage diagrams. Critical or sensitive data being ingested into the data lake should be documented on data flow diagrams.  This will add to the understanding of the Data Consumers by highlighting the source of that data.  Such documentation also helps prevent duplicate data being loaded into the data lake in the future.

I hope I have convinced you that if you want a data lake to support your business decisions, then Data Governance is absolutely critical.  Albeit that it may not need to be as granular as the definitions and documentation that you would put in place for a data warehouse, it is needed to ensure that you create and maintain a data lake and not a data swamp!

Ingesting data into data lakes without first understanding that data, is just one of many data governance mistakes that are often made. You can find out the most common mistakes and, more importantly, how to avoid them by downloading my free report here.

Does it have to be called Data Governance?

This is a question that I get asked fairly regularly. After all it is not an exciting title and in no way conveys the benefits that an organisation can achieve by implementing Data Governance. Sadly however, there is no easy yes or no answer. There are a number of reasons for this:

  1. Data governance is a misunderstood and misused data management term

Naturally I am biased, but in my view, data governance is the foundation of all other data management disciplines (and of course therefore the most important). But the fact remains that despite an increasing focus on the topic, it remains a largely misunderstood discipline.

On top of this, it is a term which is frequently misused. A few years ago, a number of Data Security software vendors were using the term to describe their products. More recently the focus on meeting the EU GDPR requirements has led to a lot of confusion as to whether Data Protection and Data Governance are the same thing and I find that the terms are being used interchangeably. (For the record, having Data Governance in place does help you meet a chunk of the GDPR requirements, but they are not the same thing).

Having more people talking about Data Governance is definitely a good thing, but unless they are all meaning the same thing, it leads to much confusion over what data governance really is.

I explored this topic in a bit more detail in this blog: Why are there so many Data Governance Definitions?

In order to understand whether Data Governance is the right title for your organisation to call it, I would start with looking at how you define data governance. And this step leads nicely to the next item for consideration.

  1. Sometimes it is right to include things which are not pure data governance in the scope of your data governance initiative.

This is a topic that I covered in my last blog which you can read here.

To summarize that article, it is just not possible to have one or more people focus purely on Data Governance in smaller organisations. It’s a luxury of large organizations to be able to have separate teams responsible for each different data management discipline (e.g. Data Architecture, Data Modelling or Data Security).  Going back to my point above, if data governance is the foundation for all other data management disciplines, it is only natural that the line between them can sometimes get a little blurred. As a result of this, the responsibilities of the Data Governance Team can get expanded.

So consider what is included within the scope of your data governance initiative and decide whether it be more appropriate to name the initiative and your team (either or both)  something that is more aligned to the wider scope of the initiative and activities of the team.

Is the name going to make cultural change harder to achieve?

Achieving a sustainable cultural change is one of the biggest challenges in implementing data governance and insisting on calling it “data governance” could make achieving that cultural change more difficult if the term doesn’t resonate within your organization. This is related to a topic that I explored in another old blog Do we have to call them Data Owners?

Whether we’re talking about the roles, the team, or even the initiative the same principles are true. It is better to choose a name that works for the culture in your organization than to waste considerable effort trying to convince people that the “correct” terminology is the only one to use.

It would be my preference to explain that the initiative is to design and implement a Data Governance Framework, but if the primary reason for implementing data governance is to improve the quality of your data, perhaps calling it the “Data Quality Team” and “Data Quality Initiative” would fit better? After all, that very much focuses on the outcome of what you’re doing.  It also addresses the question that everybody asks (or should ask) when approached to get involved in data governance of “why are we doing this,” which is usually followed by “what’s in it for me?”

When having these conversations, I explain the initiative in terms of its outcomes (e.g. better quality data which will lead to more efficient ways of working, reduced costs and better customer service). That is a far easier concept to sell rather than implementing a governance structure, which can sound dull and boring.

Is the name causing confusion?

In the early days of a data governance initiative, the talk is all about designing and implementing a data governance framework. Once this work has been achieved you start designing and implementing processes which have “Data Quality” in their titles:

  • Data Quality Issue Resolution

  • Data Quality Reporting

I have been fortunate enough to work with organizations in the past who have had both a Data Governance Team (supporting the Data Owners and Data Stewards) and a Data Quality Team (responsible for the processes mentioned above) but that is fairly unusual in my experience. It is more common for the Data Governance Team to support the above processes. So it is worth considering whether it would confuse people if they had to report data quality issues to the Data Governance Team?

In summary, I would not want to miss the opportunity to educate more people on what Data Governance really is. But the banner under which it is delivered can be altered to make your data governance implementation both more successful and more sustainable. So if having considered all the points above in respect of your organization and you want to call it something else, then that is fine with me.

Deciding what to call your initiative is only the start of many things you need to do to make your Data Governance initiative successful.   You can download a free checklist of the things you need to do here. (Don't forget this is a high level summary view, but everyone who attends either my face to face or online training gets  a copy of the complete detailed checklist which I use when working with my clients.)