January 22, 2026 • Engineering

The Data Audit: How to Spot “Dirty Data” Before You Migrate

Raajshekhar Rajan ClonePartner Team

The nightmare scenario is always the same.

You spend weeks writing migration scripts. You map every field. You run your tests on a small sample file, and everything looks perfect.

Then, on Go-Live weekend, you run the script on the full production database. It runs for 4 hours… and crashes at 99%.

The culprit? A single row in the Price column containing the text “TBD” instead of a number. Or maybe a duplicate ID in a field that was supposed to be unique.

This is the “Dirty Data” problem. And if you don’t audit your data before you write your scripts, you are building on quicksand.

In this guide, we’ll look at the three “Silent Killers” of migration projects and how to spot them instantly without uploading your sensitive data to the cloud.

Killer #1: The “Placeholder” Date

Legacy systems hate empty fields. Instead of leaving a birthday NULL, old software often auto-fills a “Placeholder Date” to satisfy database constraints.

Common offenders include:

  • 1900-01-01
  • 1970-01-01 (The Unix Epoch)
  • 9999-12-31 (The “Forever” date)

The Risk: If you migrate these values into a modern CRM (like Salesforce or HubSpot), you will trigger thousands of “Happy Birthday” emails to customers who were born in 1900.

The Fix: Check the Value Distribution of your date columns. If 40% of your rows have the exact same date, you need to write a transformation rule to convert that specific date to NULL.

Killer #2: Low Cardinality in “Unique” Fields

Cardinality refers to the uniqueness of data in a column.

  • High Cardinality: Values are unique (e.g., Social Security Numbers, User IDs).
  • Low Cardinality: Values repeat often (e.g., Country, Status, Gender).

The Risk: You assume the Email column is unique, so you use it as a Primary Key in your new system. But your legacy data contains 50 rows where the email is unknown@example.com or test@test.com.

When you try to import this, your new database will throw a Duplicate Entry Error and reject the entire batch.

Killer #3: The “Mixed Type” Column

Excel is notorious for this. You have a column called Phone_Number. It looks like numbers. But row 4,050 contains a note: “Call after 5pm”.

The Risk:

  • SQL Database: VARCHAR cannot become INT. Crash.
  • JSON API: Sends a string instead of a number, breaking the frontend sorting logic.

Tutorial: How to Audit Data (Privately)

Most data profiling tools are cloud-based. But you can’t just upload your customer’s PII (Personally Identifiable Information) to a random website to check for nulls. That’s a GDPR violation waiting to happen.

We built a Local Data Profiler that runs entirely in your browser. It gives you an X-Ray of your CSV without the data ever leaving your device.

Step 1: Upload & Scan

Drag your CSV export into the Profiler. The tool calculates statistics on the fly using Web Workers.

Step 2: Check “Completeness”

Look at the Null Rate for every column.

  • Scenario: Your target system requires Last Name.
  • Audit: The Profiler shows Last Name is only 85% Complete.
  • Action: You now know you must write a script to fill the missing 15% with a placeholder like “Unknown.”

Step 3: Check “Top 5 Values”

Look at the distribution card for your Phone column.

  • Audit: The tool shows the most common value is N/A (appearing 500 times).
  • Action: Add a logic step: IF phone == 'N/A' THEN set NULL.

[Try the Enterprise Data Profiler]

  • Spot Nulls & Duplicates
  • Identify “Placeholder” Data
  • 100% Client-Side Privacy

Don’t let a single bad row ruin your Go-Live weekend. Audit first, migrate second.