Data-diff is a new open source project that was released by Datafold earlier this week. It is used to validate data in various databases.
It uses a simple CLI to generate monitoring and alerts, and can be used to switch column types of different formats.
According to the GitHub project page, data-diff can check more than 25 million rows of data in less than 10 seconds and more than 1 billion rows in 5 minutes. It works for tables with billions of rows of data.
It works by breaking the table into smaller segments and then executing checksums for each segment in both databases. If these checksums are not equal, it will divide the segment into even smaller segments and sum it up until it finds lines that are different.
Possible uses highlighted on the project page include data migration validation, data pipeline validation, prevention and maintenance of SLO data integrity, debugging complex data pipelines, and self-healing replication creation.
“Data-diff meets a need that has not been met before,” said Gleb Mezhansky, founder and CEO of Datafold. “Today, every business that owns data duplicates data between databases in some way, for example, to integrate all available data into a repository or lake of data, to use it for analytics and machine learning. Data replication on a scale is a complex and often error-prone process, and although several vendors and open source tools provide solutions for replication, there were no tools to validate such replication. As a result, engineering teams resorted to manual one-time checks and tedious investigations of discrepancies, and data consumers could not fully trust data reproduced from other systems. ”
Find a project on GitHub here.