<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>familyName</ref>
<search>
<fuzzy>
<levenshtein>
<threshold>3</threshold>
</levenshtein>
</fuzzy>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
Fuzzy Searching
Since 4.6
This functionality is available since version 4.6.
|
Introduction
This feature is available only when using the native repository implementation. |
For an introduction, please see Fuzzy Searching section in the overview document.
Fuzzy Search Methods
Currently, there are two methods available:
Method | Description |
---|---|
Levenshtein edit distance |
Matches according to the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. (From wikipedia.) |
Trigram similarity |
Matches using the ratio of common trigrams to all trigrams in compared strings.
(See PostgreSQL documentation on |
Using in Correlation
Specification of the Filters
The fuzzy searching filters are specified in search/fuzzy
configuration item.
Let us have look at an example that searches for users having family name within the Levenshtein distance to the provided one of at most 3.
There are the following options available:
Property | Description | Default |
---|---|---|
|
Upper limit on the edit distance to be matched. |
Must be specified. |
|
Is the value of "threshold" meant to be inclusive? |
|
Property | Description | Default |
---|---|---|
|
Lower limit on the similarity to be matched. |
Must be specified. |
|
Is the value of "threshold" meant to be inclusive? |
|
Confidence Values
When using fuzzy search, not all search results are equally relevant. Typically, the higher Levenshtein edit distance, the lower confidence we have in the particular match. On the other hand, the higher trigram similarity value, the higher confidence.
Therefore, midPoint allows to specify a transformation from the fuzzy string metric (edit distance or similarity value) to the confidence value of (0, 1].
There are reasonable expressions that are used by default.
For Levenshtein edit distance, it is 1/(d+1)
, where d
is the value of the distance.
It works like this:
Edit distance | Resulting confidence |
---|---|
0 (exact match) |
1.0 |
1 |
0.5 |
2 |
0.333 |
3 |
0.25 |
4 |
0.2 |
… |
… |
For trigram similarity, it is simply the value of the similarity itself.
The default formulas may or may not fit your needs. You can provide any expression to do the computation. For example, the following code will map distances of 0, 1, 2, and 3 to the confidence values of 1.0, 0.9, 0.8, and 0.7, respectively.
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>familyName</ref>
<search>
<fuzzy>
<levenshtein>
<threshold>3</threshold>
</levenshtein>
</fuzzy>
<confidence>
<expression>
<script>
<code>[1.0, 0.9, 0.8, 0.7][input]</code>
</script>
</expression>
</confidence>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
Multiple Correlation Items
If there are multiple correlation items in given correlation rule, their confidences are multiplied. For example:
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>givenName</ref>
<search>
<fuzzy>
<similarity>
<threshold>0.5</threshold>
</similarity>
</fuzzy>
</search>
</item>
<item>
<ref>familyName</ref>
<search>
<fuzzy>
<levenshtein>
<threshold>3</threshold>
</levenshtein>
</fuzzy>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
(Here we use the default formulas for confidence values.)
Now, let us assume that a correlation candidate has a given name with the similarity of 0.8, and the family name with an edit distance of 1. Its overall confidence is then computed as:
Property | Fuzzy search metric value | Confidence factor |
---|---|---|
|
0.8 |
0.8 |
|
1 |
0.5 |
Overall confidence |
0.4 (= 0.8 x 0.5) |
The overall confidence value may be later scaled using a custom rule weight when rules are composed together, as described in rule composition document. |
Using in Filters
The use of fuzzy matching outside correlation is highly experimental.
In particular, matching of Here we describe it only for educational purposes - to emphasize the fact that correlation is ultimately implemented using regular queries. |
A query based on the Levenshtein edit distance:
<q:query xmlns:q="http://prism.evolveum.com/xml/ns/public/query-3">
<q:filter>
<q:fuzzyStringMatch>
<q:path>familyName</q:path>
<q:value>gren</q:value>
<q:method>
<q:levenshtein>
<q:threshold>3</q:threshold>
</q:levenshtein>
</q:method>
</q:fuzzyStringMatch>
</q:filter>
</q:query>
A similarity-based filter:
familyName similarity ('gren', 0.5, true)
Limitations
This feature is available only when using the native repository implementation.