Entity Resolved Knowledge Graphs
KGC 2024
•
1h 34m
Knowledge graphs have spiked recently in popular use, for example in retrieval augmented generation (RAG) methods used to mitigate hallucination in LLMs. Graphs emphasize relationships in data, adding semantics – more so than with SQL or vector databases. However, data quality issues can degrade linking during KG construction and updating, which makes downstream use cases inaccurate and defeats the point of using a graph. When you have join keys (unique identifiers) building relationships in a graph may be straightforward, although false positives (duplicate nodes) can result from: typos or minor differences in attributes like name, address, phone, etc.; family members sharing email; duplicate customer entries, and so on. This masterclass provides a hands-on introduction to Entity Resolved Knowledge Graph is, why it’s important, plus patterns for deploying entity resolution (ER) which are proven to work. We’ll cover how to make graphs more meaningful in data-centric architectures by repairing connected data: unify complex and noisy data from across multiple data sources consolidate duplicate nodes and reveal hidden connections create more accurate, intuitive graphs providing greater utility Course materials leverage open data from US federal agencies – compliance audits, PPP loans, corporate ownership, etc. – layered atop a SafeGraph dataset about Las Vegas metro area businesses. We use open source, showing a Python API for ER in Senzing, constructing a KG in Neo4j from the results, then runing graph analytics and visualization to show the before/after effects. A general architectural pattern for ER is to use multiple levels of detail for graph data. A data graph tier in a high-resolution lower layer tracks provenance, while a knowledge graph tier in a higher layer adds structure and semantics. Senzing has a 200 ms SLA in API calls, so when there are audits, feedback from end-users about data privacy updates, etc., these updates propagate into production use graphs rapidly, while maintaining provenance and evidence for the merge decisions. People who have natural language experience may ask about named entity recognition (NER), how ER differs? NER provides labels for spans of parsed text in unstructured data, while ER resolves unique identifiers based on free-text name and address fields in structured or semi-structured data. ER provides means for linking data records, gives supporting evidence (provenance and data lineage) for decisions made, and efficient ways to audit and update entities, upstream from the KG. Typical use cases for Senzing leverage data about people and companies, resolving free-text data from fields related to names and addresses. For example, most of the voter registration records in the US are managed using Senzing. Using plug-ins for custom feature comparators, other use cases have also resolved maritime vessel IDs, vehicle tracking, and more.