Wednesday, November 18, 2009

Merge-Match processing – Matching Rule breakdown…

As I stated in a previous blog entry, I would document all of the matching rule and logic structures that I know of and have encountered in the past 20 years or so of my career. Most of these terms and techniques are common to data warehousing. In-fact I have made this list a lot shorter than I was planning and have just kept to the basics for now. Yes, I expect to implement all of the basics below in the API I am creating.

the Structure is simple, all logic used to define a match is based on an archetypical logic structure, either simple logic or conditional. Then you have  the scoring and comparison scoring mechanisms that are used to match disparate data into logic matched items. I am not documenting here how these are done or implemented in code just yet.

Simple Matching Rules

  • Match All             Matches all rows within a match group
  • Match None         Turns off matching.

Conditional Matching Rules

  • Conditional match rules specify the conditions under which records match.
  • A conditional matching rule allows you to combine multiple attribute comparisons. When more than one attribute is involved in a rule, two records are considered to be a match only if all comparisons are potential match’s.

Comparison Scoring that can be used to indicate a match

Similarity Scoring 

  • A minimum similarity score required for two pieces of data which is used as the basis of a potential match. For example: A value of 10 indicates an exact match, and a value of 0 indicates a non-match. There are many ways of calculating this and as always there are great examples and ideas on the best measures all over the web.

Blank Scoring

  • A way of dealing with empty/null values in matching, this may involve automatically treating them as valid matches or as invalid ones. -  also a number value maybe given if part of an elaborate scoring system.

Comparison Scoring

  • Each attribute in a conditional match rule is assigned a comparison algorithm, which specifies how the attribute values are compared. Multiple attributes may be compared in one rule with a separate comparison algorithm selected for each. The real power of complex matching is in the Conditional rule aspect of matching.
  • Example types of Comparison could be:
    • Exact Match – The most common and easiest. Attributes match if their values are exactly the same. For example, "Dog" and "dog!" would not match, because the second string is not capitalized and contains an extra character. Some systems hash this value others just mark up a score in a related table. I aim to support both methods. This type of matching is valid for all data types. 
    • Soundex Comparison - Converts the data to a Soundex representation and then compares the text strings. If the Soundex representations match, then the two attribute values are a potential match. Not a very good way to do things to be honest, and pretty useless in any language other than English.
    • Abbreviation/Acronym Comparison – Quite simply a lookup, these are very domain and data source specific so care should be used. For example, "International Business Machines" would match "IBM", “Management” would match “Mgmt.”, and so on...

No comments: