From Messy to Meaningful: How to Handle Inconsistent Data in Insurance

In this article, we explore how organisations can combat issues with incomplete or inconsistent data, explore specific workarounds for the key issues we’ve seen across the insurance industry, and share some insights for data providers on how they can help improve data quality.

Contents

MOT test data is widely used across industries to support operational and strategic decisions.

For example, automotive marketplaces rely on this information to assess the condition and value of vehicles listed for sale. Insurance companies use historical MOT data to evaluate risk profiles and identify patterns in vehicle maintenance that may influence insurance costs. Leasing companies depend on accurate MOT records to monitor compliance, plan maintenance schedules, and ensure vehicle safety. Finally, data analytics providers use the dataset to enrich vehicle history reports and to build scoring models that help other businesses make informed decisions. 

However, without confidence in the completeness and consistency of the dataset, data-driven decision-making becomes unreliable, and derived models or insights could mislead both businesses and consumers. 

Issue 1 - Missing MOT Test Data 

Following the release of the new MOT Test API, a comparison with data from the legacy API shows an incomplete overlap between the two sources. Approximately 2% of historical records that were available via the old API are missing in the new dataset. And vice versa - the new API also contains MOT test records dated to the previous years that were not accessible through the old API before it was switched off.

This indicates that the migration between the two systems was not fully synchronised, and the new dataset is not complete or consistent as expected.

The Impact:

  • The dataset loses reliability as it doesn’t represent a complete history of MOT tests for all vehicles.
  • Analytical models relying on this data (e.g., for risk assessment or vehicle condition analysis) may produce biased or incomplete results.
  • Users relying on these data for policy analysis, risk modeling, or public transparency cannot confidently compare results over time, since data gaps introduce changes in trends.

The Workaround: 

Perform a reconciliation exercise by joining legacy and new MOT data to classify rows as unchanged, changed, missing in new, present in new only. Calculate completeness metrics (overall match rate, missing count, new-only count) and backfill missing records where possible. Add metadata to indicate record source and document mitigation steps to ensure transparency for all data users.

Report the issue to the data provider via a service request, sharing specific examples of discrepancies.

Opportunities For Collaboration:

  1. Work with the data provider to review and resolve inconsistencies between the public dataset and the official database.
  2. Encourage inclusion of data quality metrics in future dataset releases. This would make the dataset more transparent and easier to trust by allowing its users to assess how complete and consistent the data is before using it. Examples of such metrics could include:
  • completeness - percentage of records with non-null values for key fields such as test complete date, test result, and odometer reading
  • coverage - confirmation that all vehicles and test periods from the previous release are represented in the current release 

Issue 2 - Lack of Data Standardisation Between Regions 

Following the inclusion of Northern Ireland MOT history data in the dataset, several categorical and date fields now contain inconsistent or non-standardised values representing the same concepts in different formats.

For example, in the primaryColour field, Great Britain (GB) records use values such as "Red", "Blue", "Green", while Northern Ireland (NI) records contain uppercase variants: "RED", "BLUE", "GREEN". 

Another example is the expiryDate field, it is defined in the data provider’s documentation as “Date the MOT test will expire.” In GB data, this field is populated only for passed tests, which is logical since a failed test cannot have an expiry date. However, in NI data the same field is frequently populated with default or incorrect values, such as the placeholder date 1900-01-01, or is set to the previous test’s completion date, both of which conflict with the intended meaning of the field.

The Impact: 

  • This inconsistency causes duplicate categories representing the same concept (e.g., RED = Red).
  • It increased complexity in downstream data analysis.
  • This issue introduces risks of inaccurate aggregation or filtering.

The Workaround: 

Because the mapping layer on the provider’s side doesn’t standardise all categorical attributes, it should be implemented to normalise values to a unified format and to have similar logic across both regions represented in the dataset.

Example mapping to keep the new values aligned with ones before the dataset update:

RED, Red -> Red

BLUE, Blue -> Blue

GREEN, Green -> Green

Example of a mapping rule for expiry date: for NI records where MOT test result is failed, map expiry date to null.

While such mappings ensure consistency for analytical purposes, the approach is manual and requires coordination across teams. We want to prevent each team or project from maintaining its own set of mapping rules. This can lead to duplication of effort, inconsistent mappings, and maintenance challenges.

Opportunities For Collaboration:

  1. Introduce a standardised data dictionary with consistent value sets across datasets.
  2. Promote uniform casing and terminology before dataset publication (e.g., lowercase or Title Case). 

Issue 3 - Outliers in the Dataset 

There are several clear outliers when it comes to vehicle registration years. For instance, some vehicles have a registration year listed as 0013, which is incorrect given that the dataset only covers cars, motorcycles, and vans and their MOT tests in Great Britain (England, Scotland, and Wales) since 2005, and Northern Ireland since 2017. 
Additionally, there are unrealistic or invalid values for engine size, such as –1 or 999999, which are technically not possible for real vehicles. 

The Impact: 

  • Such outliers can skew analytical results, particularly when calculating statistics like vehicle age distributions or engine capacity averages.
  • They also undermine confidence in the dataset’s overall reliability, especially for downstream consumers who depend on it for business-critical insights and modelling.

The Workaround: 

Apply manual data cleaning rules on the consumer side to identify and handle such records (e.g., excluding registration years earlier than 1950 or setting impossible engine sizes like –1 or 999999 to null).

While this approach enables analysis, it shifts the quality control step to data users and is not scalable, as new anomalies may appear with each dataset release.

In practice, data providers should handle fundamental validity checks (e.g., ensuring year and engine size fields fall within logical ranges), while consumers may apply additional, use-case-specific filtering or modelling thresholds.

Opportunities For Collaboration:

  1. Implement validation rules and quality checks before data publication to help prevent inconsistencies.
  2. Replace invalid numeric values with standardised missing-value indicators (e.g., null or 0).
  3. Apply automated anomaly detection to flag potential outliers early in the data pipeline.
  4. Introduce regular data audits to ensure consistency across historical and newly added records.
  5. Collaborate on defining which validation rules should be handled by the provider (logical impossibilities) versus by data consumers (analytical outliers).

Issue 4 - Inconsistent Documentation 

The official documentation for the MOT dataset states that the test completed date and test result attributes should always be populated with non-null values; however, some records contain nulls.

According to the documentation, the odometer unit has 3 valid values - “mi”, “km”, and null. Once we query the data, we’ll find another possible value - “Unknown”.

The Impact: 

  • The mismatch between the dataset and its documentation makes it difficult for users to correctly interpret certain attributes. This inconsistency can lead to inaccurate assumptions, misclassification of records, and potential issues in data pipelines or analytical models that rely on documented metadata.
  • It also increases onboarding time for new users who need to explore the data manually to confirm what is included.

The Workaround: 

The dataset must be manually reviewed in detail to gain a complete understanding of its structure and content before accurate documentation can be produced. This approach is time-consuming, as it needs to be repeated after each data update. As the dataset continues to evolve, maintaining accurate documentation becomes difficult without a reliable source of reference.

Opportunities For Collaboration:

  1. Update the dataset documentation to accurately describe the dataset. 
  2. Version-control the documentation so that changes in data are traceable for all consumers. While minor documentation updates (e.g., clarifications, typo fixes) can have their own sub-versions, every major API or dataset release should be accompanied by an updated, version-matched documentation set.

Addressing inconsistent and incomplete data is essential for insurers to maintain reliable analytics and risk assessments. Using these tips, as well as some collaboration with data providers, the industry can ensure more accurate and actionable insights.

Struggling with data quality challenges? Whether you’re tackling inconsistent datasets or preparing your organisation for the next wave of analytics hurdles, we can help

Author

Darya Zelenevskaya

Senior Data Analyst – People Lead

Darya is a Senior Data Analyst with more than 14 years of experience in IT, specialising in transforming complex data into actionable insights that drive business performance. With a passion for helping businesses make informed decisions, she works to bridge the gap between data and strategic goals.