Entity Matcher API

A two-stage entity matching system that finds which entities (companies, people, topics) are relevant to a given article (headline + body). In the current deployment, only keyword matching is active; semantic search and cross-encoder verification are implemented but disabled.


Table of contents

  1. Overview
  2. How it works: the pipeline
  3. Stage 1: Keyword matching
  4. Stage 2: Semantic matching (disabled)
  5. Stage 3: Cross-encoder verification (disabled)
  6. API reference
  7. Examples
  8. Configuration
  9. Evaluation and web UI

Overview

Input: An article (headline + body text).
Output: A list of matched entities, each with entity_id, confidence, source, and matched_terms.

The pipeline is staged conceptually, but only Stage 1 (keyword matching) is currently active in production. Stages 2 and 3 remain in the codebase but are not executed.

  1. Keyword stage (active) — If the article contains an entity's defined keyword phrases (after parsing AND/OR rules), that entity is matched with high confidence (0.99).
  2. Semantic stage (disabled) — Would compare the article to entity descriptions via embeddings and suggest candidates when no keyword match is present. This stage is currently turned off.
  3. Verification stage (disabled) — Would run a cross-encoder over borderline semantic matches to confirm or reject them. This stage is also currently turned off.

So: keyword matches are trusted immediately, and at present they are the only kind of match produced by the API.


How it works: the pipeline

Article (headline + body)
         │
         ▼
┌─────────────────────┐
│  1. Keyword stage   │  →  Aho-Corasick + AND/OR group index
│  (phrase match)     │
└─────────────────────┘
          │
          ▼
   MatchResponse { matches: [ EntityMatch, ... ] }

Stage 1: Keyword matching

Goal: Detect entities whose defined keyword expressions appear in the article text.

Entity definitions

Each entity has an EntityKeyword expression in the data (e.g. in entities.csv), for example:

Parsing logic (AND/OR and parentheses)

Expressions are parsed into AND-groups of OR-options (respecting parentheses):

  1. Split the expression by AND only when parentheses depth is 0.
  2. For each AND operand, unwrap one pair of outer parentheses if present, then split by OR at top level.
  3. Strip surrounding quotes from each term.
  4. Matching rule: for an entity rule to match, every AND-group must be satisfied by at least one matched option.

Example (matching semantics):

Case and keyword matching rules

We use two Aho-Corasick keyword indices under the hood:

Heuristics:

There is no automatic plural expansion. If you need plurals like "IPOs" or "LEDs", they must appear explicitly in the keyword expression.

Implementation

Result: Each entity that has at least one phrase matched in the text is returned with source="keyword" and confidence=0.99. There is no secondary semantic or LLM verification stage in the live system.


Stage 2: Semantic matching

Goal: Find entities that are semantically related to the article even when no keyword phrase appears (paraphrases, related concepts).

Algorithm

  1. Entity representation: For each entity we form a text: EntityName + " " + EntityKeyword (name and full keyword expression). This is embedded once at startup.
  2. Embedding model: BAAI/bge-small-en-v1.5 (Sentence Transformers). Embeddings are L2-normalized.
  3. Index: FAISS with IndexFlatIP (inner product = cosine similarity for normalized vectors).
  4. At query time: The article (headline + body) is embedded with the same model. We search the index for top-K (e.g. 20) nearest entities by inner product.
  5. Filtering: Only entities with score ≥ SEMANTIC_THRESHOLD (e.g. 0.70) are considered. Entities already matched in the keyword stage are skipped.

Thresholds

So semantic matching uses a two-tier threshold: high confidence → accept; medium → verify.


Stage 3: Cross-encoder verification

Goal: For entities in the "review" band, decide whether the article is actually about that entity (avoid semantic drift).

Algorithm

This step reduces false positives when the embedding similarity is only moderate.


API reference

POST /match-entities

Finds entities relevant to the given article.

Request body (JSON):

Field Type Required Description
headline string Yes Article headline.
body string Yes Article body text.

Response (JSON):

Field Type Description
matches array List of matched entities (see below).

Each element of matches:

Field Type Description
entity_id integer Entity identifier.
confidence float 0.99 for keyword; raw score for semantic; 0.90 for verified semantic.
source string "keyword" | "semantic" | "llm" (llm = cross-encoder verified).
matched_terms array Phrases/terms that triggered the match (keyword phrase, or semantic reason).

Examples

Example 1: Keyword match (company + topic)

Entity (conceptual): "boAt" with expression
boAt AND ("launch" OR "launched" OR "funding" OR "IPO" OR "market")

Request:

curl -X POST http://localhost:8000/match-entities \
  -H "Content-Type: application/json" \
  -d '{
    "headline": "boAt launches new smartwatch",
    "body": "The company announced the product at an event in Delhi. Market reaction was positive."
  }'

Response (conceptually):

{
  "matches": [
    {
      "entity_id": 25683,
      "confidence": 0.99,
      "source": "keyword",
      "matched_terms": ["boat launch"]
    }
  ]
}

Here the phrase "boAt launch" (from the Cartesian product) appears in the text, so the entity is matched in the keyword stage with high confidence.


Example 2: Abbreviation plural (IPO → IPOs)

Entity: Has keyword "IPO" (abbreviation: length ≤ 5, single token). The index also contains "IPOs" and "IPOes" mapping to the same entity.

Request:

curl -X POST http://localhost:8000/match-entities \
  -H "Content-Type: application/json" \
  -d '{
    "headline": "Renewed interest in IPOs",
    "body": "Investors are returning to the primary market. Many companies are planning IPOs this year."
  }'

Response (conceptually):

{
  "matches": [
    {
      "entity_id": 26103,
      "confidence": 0.99,
      "source": "keyword",
      "matched_terms": ["ipo"]
    }
  ]
}

"IPOs" in the text matches the abbreviation plural rule, so the entity is still returned as a keyword match.


Example 3: Semantic match (no keyword phrase)

Entity: "Technology Development Board" with keywords about grants, RDI fund, etc. Article mentions "TDB" and "first call for proposals" but not the exact phrase.

Request:

curl -X POST http://localhost:8000/match-entities \
  -H "Content-Type: application/json" \
  -d '{
    "headline": "TDB announces first call for proposals",
    "body": "The Technology Development Board has invited applications under the RDI fund..."
  }'

If the embedding of the article is close enough to the entity's (name + keywords), the entity can appear in the semantic stage. If the score is above SEMANTIC_ACCEPT, it is returned with source="semantic" and confidence equal to that score. If it is in the review band, the cross-encoder decides; if it passes, the response has source="llm" and confidence=0.90.


Configuration

Relevant options in config.py:

Option Default Description
ENTITY_FILE "data/entities.csv" Path to entity definitions (EntityId, EntityName, EntityKeyword).
FORCE_CASE_SENSITIVE_TERMS "" Optional comma-separated exact terms to force case-sensitive matching.
FORCE_CASE_INSENSITIVE_TERMS "" Optional comma-separated exact terms to force case-insensitive matching.
SEMANTIC_TOP_K 20 Number of nearest entities to consider from FAISS.
SEMANTIC_THRESHOLD 0.70 Minimum similarity to consider a semantic candidate.
SEMANTIC_ACCEPT 0.82 Score ≥ this → accept without verification.
SEMANTIC_REVIEW 0.70 Score in [REVIEW, ACCEPT) → send to cross-encoder.
CROSS_ENCODER_MODEL "cross-encoder/ms-marco-MiniLM-L-6-v2" Model used for verification.
CROSS_ENCODER_ACCEPT_THRESHOLD 0.75 Cross-encoder score ≥ this → accept entity.

Entity CSV must have columns: EntityId, EntityName, EntityKeyword.


Evaluation and web UI


Summary

Stage What it does Output / source
Keyword AND/OR group index + Aho-Corasick whole-word matching source: "keyword", confidence 0.99
Semantic BGE embeddings + FAISS; two thresholds (accept vs review) source: "semantic" (high) or sent to verification
Verification Cross-encoder on (entity name, article snippet) source: "llm", confidence 0.90

Together, this gives precise phrase-based matches plus broader semantic coverage, with a local cross-encoder to control false positives in the middle band.