How does Screena handle inaccurate data?
Last updated
Was this helpful?
Last updated
Was this helpful?
Before all, let’s clarify what inaccurate data is. Either human or machine errors can cause inaccurate data. It takes various forms: manual data mistakes (e.g., permuted name fields), missing data entry controls, unstructured/free format fields, missing or non-standardized information in databases, incompatible formats between data processing systems, etc.
Screena systematically controls the completeness and quality of imported data. For example, Screena ensures dates are always provided in accordance with the . Likewise, countries shall always be imported in .
When the original data is not compliant with those standards, Screena tries to resolve it using specific and . This normalization process harmonizes and transforms data into a format that makes attribute matching consistent. Normalization libraries are enriched with new synonyms or alternative spellings whenever an unknown or incompatible value is provided.
Screena’s tackle specific data quality issues such as typos, truncated names, out-of-order name elements, and split or concatenated names.
Screena data model also provides distinct fields to differentiate structured and unstructured information (e.g., parsed names vs. full names, structured addresses vs. free format addresses).
Distinct are actionable to handle all data quality nuances. For example, it is possible to use the parameter nullMatch
and specify how a match should be handled when one attribute associated with an algorithm is either empty or not provided.
In other instances, inaccurate dates can be matched within the same year or decade.
Similarly, addresses can be matched within the same region or subregion based on the .
To achieve greater precision when screening free format fields, Screena applies advanced text analytics technics to detect distinct objects (named entities vs. addresses) within the same field and thus prevent irrelevant matches.
When it comes to name matching, Screena will call out generic machine learning models specifically trained with richer comprehensive datasets if no valid culture can be determined with high certainty.