This is Article 19 of the Qlik Sense Data Modeling Course.
๐ Qlik Sense Course – Article 19 of 28
โ Previous Article: Flag-Based Modeling
โ Next Article: Error Handling
What is a Data Quality Framework? A systematic approach to validating data quality at EVERY level of the ETL process – from staging through transform to the final data model. Early detection minimizes the effort required for later error correction.
What Will You Learn About the Data Quality Framework?
After this guide, you will be able to:
- Implement and automate Quality Gates on all ETL layers.
- Use Quality Flags for automatic data validation and for **performant** analyses. (Topic of the previous article)
- Build a Quality Dashboard for transparent monitoring.
- Detect critical errors using thresholds and variables.
Time investment: 25 min reading + 4 hrs hands-on
Quick Win: In 20 minutes you’ll have your first quality checks implemented
How Do Quality Gates Work at Each Level?
The problem: Faulty data spreads unnoticed through all layers and corrupts business decisions! Fixing an error in the application is 10x more expensive than fixing it in the load script.
The solution: Quality Gates after EVERY processing step – detect and fix problems early! This three-layer model is the foundation for all robust Qlik architectures.
The three validation layers:
| Layer | Validation Purpose | Examples |
|---|---|---|
| 1. Staging | Structural integrity of the import | File loaded? Row count? Data types? (Primary: Loading Data) |
| 2. Transform | Compliance with business rules | Required fields? Valid values? Duplicates? |
| 3. Model | Relationships & model hygiene | No Synthetic Keys? Referential integrity? |
What Are the 6 Data Quality Dimensions?
Data quality can be divided into six measurable dimensions – in the Qlik script, each dimension represents a group of `If()` checks and flags. These dimensions are defined by the DAMA Data Management Body of Knowledge, the authoritative reference for data quality standards.
How Is Completeness Evaluated?
Are all expected data present (no NULL values in required fields)?
// Check: Required fields populated?
If(IsNull(CustomerID) OR Len(Trim(CustomerID)) = 0, 1, 0) as DQ_MissingCustomerID
How Is Accuracy Ensured?
Are the values correct and within valid ranges (contradiction to reality)?
// Check: Amount in reasonable range?
If(Amount <= 0 OR Amount > 1000000, 1, 0) as DQ_InvalidAmount
How Important Is Consistency?
Are values logically consistent across different fields (e.g. totals match)?
// Check: ShipDate not before OrderDate?
If(ShipDate < OrderDate, 1, 0) as DQ_InvalidDateOrder
What Is Validity?
Do values match the expected format (e.g. numeric, date, defined list)?
// Check: Email contains @ and .?
If(NOT Match(Email, '*@*.*'), 1, 0) as DQ_InvalidEmail
What Is Uniqueness?
Are primary keys truly unique with no duplicates?
// Check: Duplicates by OrderID? (Checked with GROUP BY in the script)
Count(*) as DuplicateCount
GROUP BY OrderID
HAVING Count(*) > 1
How Important Is Timeliness?
Is the data current enough for analysis (e.g. load date vs. today)?
// Check: Last update more than 24h ago?
If(Today() - LoadDate > 1, 1, 0) as DQ_DataStale
How to Implement Structural Validation in the Staging Layer?
In the staging layer, right after loading raw data, you check the basic integrity of the import.
// Load raw data
STG_Sales:
LOAD
OrderID,
CustomerID,
ProductID,
OrderDate,
Amount,
Quantity
FROM
(ooxml, embedded labels);
// Quality Check 1: Was the file loaded at all?
LET vRowCount = NoOfRows('STG_Sales');
TRACE [QC] Staging: Loaded $(vRowCount) rows;
IF $(vRowCount) = 0 THEN
TRACE Staging: No data loaded! Stopping script.
SET vQuality_Staging = 'FAIL';
EXIT SCRIPT;
ELSE
SET vQuality_Staging = 'PASS';
ENDIF
Explanation:
- NoOfRows(): Counts records in the loaded table.
- EXIT SCRIPT: Stops on critical error (e.g. no data loaded) – this is part of Error Handling.
How Do I Check the Row Count Range?
Check whether the record count is within the expected range to detect export errors or data corruption early.
// Define expected range
LET vExpectedMin = 1000;
LET vExpectedMax = 1000000;
IF $(vRowCount) < $(vExpectedMin) OR $(vRowCount) > $(vExpectedMax) THEN
TRACE Staging: Row count outside expected range! Check the export.;
SET vQuality_Staging_RowCount = 'WARN';
ELSE
SET vQuality_Staging_RowCount = 'PASS';
ENDIF
How Does Data Type Validation Work?
Data type validation is essential. If a column that should be numeric (e.g. `Amount`) is loaded as text, all aggregations will fail later.
// Check data types
Temp_DataTypeCheck:
LOAD
Count(*) as TotalRows,
Sum(If(NOT IsNum(Amount), 1, 0)) as AmountNotNumeric,
Sum(If(NOT IsNum(Quantity), 1, 0)) as QuantityNotNumeric
RESIDENT STG_Sales;
LET vAmountNotNumeric = Peek('AmountNotNumeric', 0, 'Temp_DataTypeCheck');
TRACE [QC] Staging: Amount NOT numeric in $(vAmountNotNumeric) rows;
DROP TABLE Temp_DataTypeCheck;
Explanation: `IsNum()` checks whether the value is numeric. If `AmountNotNumeric` is greater than 0, you need to correct the load statement.
How to implement business rules and quality flags in Qlik Sense?
In the transform layer, you validate the logic and create the actual **Quality Flags**. This is the most performant method for error marking (see Flag-Based Modeling).
TRF_Sales:
LOAD
OrderID,
CustomerID,
ProductID,
Date(OrderDate) as OrderDate,
Amount,
Quantity,
// Quality Flags - Completeness
If(IsNull(CustomerID) OR Len(Trim(CustomerID)) = 0, 1, 0)
as DQ_MissingCustomerID,
If(IsNull(ProductID) OR Len(Trim(ProductID)) = 0, 1, 0)
as DQ_MissingProductID,
// Quality Flags - Range Validation
If(Amount <= 0, 1, 0) as DQ_ZeroOrNegativeAmount,
If(Quantity <= 0, 1, 0) as DQ_InvalidQuantity,
// Quality Flags - Consistency
If(OrderDate > Today(), 1, 0) as DQ_FutureOrderDate,
// Composite Quality Flag: Marks every faulty record
If(IsNull(CustomerID) OR Amount <= 0 OR OrderDate > Today() OR Quantity <= 0,
1, 0) as DQ_HasAnyIssue
FROM (qvd);
Advantages of Quality Flags: Instead of deleting faulty records, you mark them. This way, business users can later **in the frontend** use Set Analysis (e.g. Sum({<DQ_HasAnyIssue={'0'}>} Sales)) to analyze exclusively the “good” data, while the faulty records are preserved for root cause analysis (error debugging). You can use the Qlik ErrorMode system variable to control script behavior when quality gates fail.
How Do You Calculate the Quality Summary?
Summarize the error counts to monitor the health of the data.
Quality_Transform:
LOAD
'TRF_Sales' as TableName,
Count(*) as TotalRecords,
Sum(DQ_HasAnyIssue) as RecordsWithIssues,
1 - (Sum(DQ_HasAnyIssue) / Count(*)) as QualityScore
RESIDENT TRF_Sales;
// Log quality score
LET vQualityScore = Peek('QualityScore', 0, 'Quality_Transform');
TRACE [QC] Transform: Quality Score = $(vQualityScore);
// Threshold Check
IF $(vQualityScore) < 0.95 THEN
TRACE Transform: Quality Score below 95% threshold!
SET vQuality_Transform = 'ALERT';
ELSE
SET vQuality_Transform = 'PASS';
ENDIF
Explanation: `QualityScore` is 1 minus the error rate, giving you a percentage of “good” records. When the defined `Threshold` (here 95%) is not met, an **ALERT** is triggered.
How is referential integrity enforced in the model layer?
In the final model, the relationship between fact and dimension tables must be validated. This is primarily about avoiding *orphaned records*. This is closely related to avoiding Synthetic Keys.
// Load dimensions
Dim_Customer:
LOAD CustomerID, CustomerName
FROM (qvd);
// Facts with Referential Integrity Check
TRF_Sales:
LOAD
OrderID,
CustomerID,
ProductID,
Amount,
// Does CustomerID exist in Dim_Customer?
If(NOT Exists(CustomerID, CustomerID), 1, 0) as DQ_OrphanedCustomer,
// Does ProductID exist in Dim_Product?
If(NOT Exists(ProductID, ProductID), 1, 0) as DQ_OrphanedProduct
FROM (qvd);
Explanation:
- Exists(Field, Value): Checks whether the value of `CustomerID` in the currently loaded table already exists in the `CustomerID` field of the `Dim_Customer` table (the loaded dimension).
- DQ_OrphanedCustomer: 1 when the `CustomerID` exists in the Sales table but **no** matching counterpart is found in the dimension.
Why is this important? Orphaned records cause NULL associations, which can lead to missing values in your dimensions and **incorrect aggregations**.
In production environments, quality gates are most effective when embedded in a three-stage ETL architecture, where each layer – staging, transform, and model – validates data at the appropriate level.
How Does Duplicate Detection Work?
Duplicates are one of the most common causes of unreliable metrics. Identify them before processing the data further.
// Identify duplicates (by OrderID)
Temp_DuplicateCheck:
LOAD
OrderID,
Count(*) as DuplicateCount
RESIDENT STG_Sales
GROUP BY OrderID;
// Add duplicate flag
LEFT JOIN (STG_Sales)
LOAD
OrderID,
If(DuplicateCount > 1, 1, 0) as DQ_IsDuplicate,
DuplicateCount
RESIDENT Temp_DuplicateCheck;
// Log duplicate flag
LET vDuplicateCount = Sum(If(DQ_IsDuplicate = 1, 1, 0), 'STG_Sales');
TRACE [QC] Duplicates: Found $(vDuplicateCount) duplicate records;
How Does the Deduplication Strategy Work?
Choose a strategy: The best results come from keeping the “newest” or “most correct” record:
// Strategy 2: Keep newest record (MOST IMPORTANT STRATEGY)
TRF_Sales_Latest:
LOAD *
RESIDENT STG_Sales
// Keep all non-duplicates AND only the newest duplicate record
WHERE DQ_IsDuplicate = 0
OR RecNo() = FirstSortedValue(RecNo(), -OrderDate, OrderID);
Explanation: This advanced logic keeps all unique records and, for duplicates, selects only the record with the latest `OrderDate` (or the highest `SourceRowNumber` if `OrderDate` is not unique). This type of data transformation is essential for invoices and orders.
What Is Cross-Field Validation?
Use consistency checks to uncover logical errors between related fields.
TRF_Orders:
LOAD
OrderID,
Date(OrderDate) as OrderDate,
Date(ShipDate) as ShipDate,
TotalAmount,
// Cross-Field Consistency Flags
If(ShipDate < OrderDate, 1, 0) as DQ_ShipBeforeOrder,
If(Abs(TotalAmount - (OrderAmount + ShippingCost)) > 0.01, 1, 0)
as DQ_TotalMismatch,
//... additional checks
FROM (qvd);
How Do I Integrate the Quality Dashboard and Set Up Alerting?
Collect quality metrics over time (historization) and use Qlik to react automatically to problems.
// Summarize and historize current load quality
//... (Summary logic)...
// Append to historical log
CONCATENATE(Quality_History)
LOAD * RESIDENT Quality_Summary;
// Save historical data
STORE Quality_History INTO (qvd);
How Does Automated Alerting with Thresholds Work?
We use the calculated `vQualityScore` from Layer 2 to send an automated notification if quality falls below a threshold.
// Define thresholds
LET vQualityThreshold = 0.95; // 95% quality required
// Calculate quality score
//... (Logic to calculate vQualityScore)...
// Alert Logic
IF $(vQualityScore) < $(vQualityThreshold) THEN
TRACE Quality Score $(vQualityScore) below threshold!
// Trigger automated action
// Here e.g. Qlik Automate can send an email:
// EXECUTE;
TRACE See Qlik Automate Email notifications for setup.
ENDIF
What Are the Best Practices for the Data Quality Framework?
The following best practices ensure your framework remains efficient and maintainable.
Layered Validation:
- [ ] Staging Layer: Structural checks (File loaded? Row count?)
- [ ] Transform Layer: Business rules (Required fields? Ranges?)
- [ ] Model Layer: Relationships (Data model problems, No orphans?)
Quality Flags:
- [ ] Mark faulty records, DON’T delete them
- [ ] Clear naming convention (DQ_ prefix)
- [ ] Composite flags for “HasAnyIssue”
Quality Logging & Performance:
- [ ] Store quality metrics in QVD (historical data)
- [ ] Quality dashboard for monitoring
- [ ] Apply performance strategies before critical checks
How Does Performance Affect Quality?
Quality checks have a performance overhead (approx. +11% to +55% additional load time) – but the cost of *incorrect* business decisions far exceeds this effort. It’s a necessary trade-off.
| Scenario | Load Time (1M Records) | Overhead |
|---|---|---|
| Without Quality Checks | 45 sec | – |
| With Staging Checks | 50 sec | +11% |
| With Transform Checks | 57 sec | +27% |
| With All Checks | 70 sec | +55% |
Conclusion: +55% load time, but 100% confidence in data quality – a good trade-off! Ideally, keep checks at the transform layer to keep staging performance as fast as possible.
How to troubleshoot Qlik Sense Data Quality Framework?
Problem: Quality checks slow down loads too much
Solution 1: Selective Checks
Don’t check ALL records, just a sample:
// 10% sample for fast checks
WHERE Rand() < 0.1
Solution 2: Parallel Processing
Use separate QVDs for quality checks – load main data in parallel:
// Main Load
STORE TRF_Sales INTO (qvd);
// Quality Checks in separate job
// Load TRF_Sales.qvd and run checks
Problem: Too many quality flags make the model confusing
Solution: Composite Flags
Combine multiple checks into one flag:
If(DQ_MissingCustomerID = 1 OR
DQ_ZeroOrNegativeAmount = 1 OR
DQ_FutureOrderDate = 1,
1, 0) as DQ_HasAnyIssue
Problem: Quality alerts fire too frequently
Solution: Realistic Thresholds
100% quality is unrealistic – set practical thresholds:
// Only alert on actual degradation
IF $(vQualityScore) < 0.95 THEN
// Alert
ENDIF
What Are the Next Steps?
You can now implement systematic data quality gates! Next:
1. Error Handling: What happens when quality checks fail and the script needs to stop? Error Handling shows you how to catch critical errors and terminate the script in a controlled manner.
2. Performance Optimization: To minimize the performance overhead, you need to further optimize the load times of your applications.
How to implement quality gates in an ETL pipeline?
- Error Handling – What to do when quality checks fail?
- Flag-Based Modeling – Flags for ultimate performance and simple Set Analysis
Previous: Flag-Based Modeling | Next: Error Handling