Do you need Data Governance over a Data Lake?
/There continues to be a lot of excitement about data lakes and the possibilities that they offer, particularly about with analytics, data visualizations, AI and machine learning. As such, I’m increasingly being asked whether you really need Data Governance over a data lake. After all, a data lake is a centralised repository that allows you to store all your structured and unstructured data on a scalable basis.
Unlike a data warehouse, in a data lake you can store your data as-is without having to structure it first. This has resulted in many organisations “dumping” lots of data into data lakes in an uncontrolled and thoughtless manner. The result is what many people are calling “Data Swamps” which have not provided the amazing insights they hoped for.
So the simple answer to the question is yes – you do need Data Governance over data lakes to prevent them from becoming data swamps that users don’t access because they don’t know what data is there, they can’t find it, or they just don’t trust it. If you have Data Governance in place over your data lake, then you and your users can be confident that it contains clean data which can found and used appropriately.
But I don’t expect you to just take my word for it; let’s have a look at some of the reasons why you want to implement Data Governance on data being ingested into your data lake:
Data Owners Are Agreed
Data Owners should be approving whether the data they own is appropriate to be loaded to the Data Lake e.g. is it sensitive data, should it be anonymised before loading?
In addition, users of the data lake need to know who to contact if they have any questions about the data and what it can or can’t be used for.
Data Definitions
Whilst data definitions are desirable in all situations, they are even more necessary for data lakes. In the absence of definitions, users of data in more structured databases can use the context of that data to glean some idea of what the data may be. As a data lake is by its nature unstructured, there is no such context.
A lack of data definitions means that users may not be able to find or understand the data, or alternatively use the wrong data for their analysis. A data lake could provide a ready source of data, but a lack of understanding about it means that it can not be used quickly and easily. This means that opportunities are missed and use of the data lake ends up confined to a small number of expert users.
Data Quality Standards
Data Quality Standards enable you to monitor and report on the quality of the data held in the data lake. While you do not always need perfect data when analysing high volumes, users do need to be aware of the quality of the data. Without standards (and the ability to monitor against them) it will be impossible for users to know whether the data is good enough for their analysis.
Data Cleansing
Any data cleansing done in an automated manner inside the data lake needs to be agreed with Data Owners and Data Consumers. This is to ensure that all such actions undertaken comply with the definition and standards and that it does not cause the data to be unusable for certain analysis purposes— e.g. defaulting missing date of births to an agreed date could skew an analysis that involved looking at the ages of customers.
Data Quality Issue Resolution
While there may be some cases where automated data cleansing inside the data lake may be appropriate, all identified data quality issues in the data lake should be managed through the existing process to ensure that the most appropriate solution is agreed by the Data Owner and the Data Consumers.
Data Lineage
Having data flows documented is always valuable, but in order to meet certain regulatory requirements, (including EU GDPR) organisations need to prove that they know where data is and how it flows throughout their company.
One of the key data governance deliverables are data lineage diagrams. Critical or sensitive data being ingested into the data lake should be documented on data flow diagrams. This will add to the understanding of the Data Consumers by highlighting the source of that data. Such documentation also helps prevent duplicate data being loaded into the data lake in the future.
I hope I have convinced you that if you want a data lake to support your business decisions, then Data Governance is absolutely critical. Albeit that it may not need to be as granular as the definitions and documentation that you would put in place for a data warehouse, it is needed to ensure that you create and maintain a data lake and not a data swamp!
Ingesting data into data lakes without first understanding that data, is just one of many data governance mistakes that are often made. You can find out the most common mistakes and, more importantly, how to avoid them by downloading my free report here.