Enabling Data Quality feature on Data Catalog

Project summary

The Data Quality Score allows teams to make efficient and meaningful statements about the quality of data being produced by the Data Engineering team, and create a comprehensive view of the current data quality state and allow teams to target the efforts where they will improve data quality the most.

Goal

Help users answer the following questions

● What is the current quality of a particular data set?

● What is the historical quality of a particular data set?

● What is the quality of all of the data sets that a given user cares about?

My role

Interaction designer, UX architect. I worked close with data engineers leads to better understand the scope of the project and user pain point

Research

It was my first project in conjunction with a Data Engineering team so it was important to me to understand the whole data onboarding process, pipeline steps and overall data flow between the teams. Also it was important to dig deeper into particular problem when users need a data quality score, how they use it and how it affects users productivity within an organization. What methods do they use to calculate data quality score?

Method used:

  • User interviews (I performed 4 user interviews with Data engineers leads and also 2 interviews with data modeling leads)
  • Card sorting exercise to understand what metrics are the most important for modelers
  • User journey mapping. I adopted this method to make it more like a Data journey mapping

Research Insights

Pain points

  • Data quality measure approach is not systematic 
  • Data quality measure is a manual process
  • There is no standardized dimensions to measure data quality
  • Very crude measure of data quality coverage of data products
  • Data consumers have no information about passing / failing sanity checks 
  • Different teams use different systems to store sanity checks and their statuses

Research summary

Before jumping into wireframing I summarized all the research insights into a readable spreadsheet to validate this with stakeholders and users. We figured 6 main metrics that have to be displayed for users. Also each metric supported by the real world use case. 

Design principles

  • Humble: Open to feedback, learns over time.
  • Attentive: user friendly UI, easy to consume.
  • Effortless: minimalistic design to make it easy to consume for users
  • Match current UI design
  • Use Data Catalog Design System
  • Visualize data whenever it makes sense

Wireframing

On this stage I create things like sitemaps, wireframes,  mockups, and other visual design assets. One of the most valuable things that I can create on this stage is a wireframe. Wireframe is a low fidelity representation of my product. Sometimes we can use them as prototypes. Most of the time I will not land on my ideal product after one wireframe sketch. So here is the best time to grab my sketch book, pencil, and a gallon of coffee. In the end the wireframes should convey the overall direction and description of the user interface. The goal here was to create low fidelity design good enough to communicate the idea to users. It’s very iterative process requires a lot of collaboration

High fidelity mockups

Production

After designing, redesigning, and… then redesigning some more. It’s a good time to start collaborating with a development team to turn my carefully crafted design into a real working product. Very important to keep a good collaboration to gather check-ins and feedback along the way because UX is really an everyone’s job and it only gets better with collaboration. 

Testing

After a product launched it’s time for testing, analyzing, metrics checking, and gathering additional feedback from users. Also this is a good time to ask myself and my team a questions like 

Where did our process go right?

Where did we struggle? 

How are the users responding to the product?

Did we solve their issues and pain points?  

With a solid round of reflection I gain a valuable knowledge that I can leverage to make future projects run even more smoothly

Outcome

Data Quality score was implemented only for data sets produced by one team, it will be populated within the other teams by the end of this year. Data Quality score shows all needed metrics to understand the quality of data. Data sets that have Data Quality scores increased CTR compared to other data sets. We started to promote data sets that have data quality scores in a Search results as well as in a browse tree. We were able to solve our main pain points to measure our data sets quality in a more robust and systematic way. Implemented standardized dimensions to measure data quality. Information about Sanity checks now available in a Data Catalog so users don’t have to make an extra step to go to sanity check repository