Entity Matcher API

The application is a keyword-based entity matcher. It identifies which entities are relevant to a news article using entity keyword expressions loaded from the current keyword snapshot (data/entities_live.json).

Overview

Input: article headline and body text.
Output: a list of matched entities with entity_id, confidence, source, and matched_terms.

The current production pipeline is a single stage:

Article (headline + body)
  -> Keyword matcher
  -> MatchResponse

All returned matches come from keyword detection and are emitted with:

Keyword matching

Goal: Detect entities whose defined keyword expressions appear in the article text.

Entity definitions

Each entity has an EntityKeyword expression in the current keyword source, for example:

Parsing logic (AND/OR/NOT and parentheses)

Expressions are parsed into a positive expression and an optional negative expression (respecting parentheses):

  1. Split the expression by AND only when parentheses depth is 0.
  2. For each AND operand, unwrap one pair of outer parentheses if present, then split by OR at top level.
  3. If a top-level NOT is present, everything before it is the positive expression and everything after it is the negative expression.
  4. Strip surrounding quotes from each term.
  5. Matching rule:
  6. every positive AND-group must be satisfied by at least one matched option
  7. if the negative expression is satisfied, the rule is rejected

Example (matching semantics):

Example (NOT semantics):

Case and keyword matching rules

We use two Aho-Corasick keyword indices under the hood:

Heuristics:

There is no automatic plural expansion. If you need plurals like "IPOs" or "LEDs", they must appear explicitly in the keyword expression.

Implementation

Result: Each entity that has at least one phrase matched in the text is returned with source="keyword" and confidence=0.99. There is no secondary semantic or LLM verification stage in the live system.

Keyword source and sync

The matcher consumes a local snapshot of the keyword API rather than calling the remote service on each request.

Runtime source order:

  1. data/entities_live.json

Snapshot generation:

Why this design:

Runtime refresh support:

The reload endpoint can fetch the remote full dump, update the snapshot files, and swap the in-memory matcher without restarting the process.

CSV import rules:

UI CSV import flow:

API reference

POST /match-entities

Finds entities relevant to the given article.

Request body (JSON):

Field Type Required Description
headline string Yes Article headline.
body string Yes Article body text.

Optional query param:

Field Type Required Description
mediaType string No Print or Online. If Print, only Print + Both rules are considered. If Online, only Online + Both rules are considered. If omitted, all rules are considered.

Response (JSON):

Field Type Description
matches array List of matched entities (see below).

Each element of matches:

Field Type Description
entity_id integer Entity identifier.
confidence float 0.99 for keyword matches.
source string "keyword".
matched_terms array Terms that triggered the match.

POST /kalki-match-entities

Evaluates whether the supplied client keyword expressions are relevant to the title and body sections of a single article.

Request body (JSON):

Field Type Required Description
headline string Yes Article headline.
body string Yes Article body text.
client_keywords array[string] Yes Keyword expressions using the same AND / OR / parentheses syntax as the entity matcher.

Supported exclusion syntax:

Response (JSON):

Field Type Description
IsRelevant bool true if any of the section-level relevance flags is true.
IsTitleRelevant bool true if any client keyword expression matches the headline.
IsFirstParaRelevant bool true if any client keyword expression matches the first body section.
IsRestOfArticleRelevant bool true if any client keyword expression matches the full headline + body context.

Section logic:


Examples

Example 1: Keyword match (company + topic)

Entity (conceptual): "boAt" with expression
boAt AND ("launch" OR "launched" OR "funding" OR "IPO" OR "market")

Request:

curl -X POST http://localhost:8000/match-entities \
  -H "Content-Type: application/json" \
  -d '{
    "headline": "boAt launches new smartwatch",
    "body": "The company announced the product at an event in Delhi. Market reaction was positive."
  }'

Response (conceptually):

{
  "matches": [
    {
      "entity_id": 25683,
      "confidence": 0.99,
      "source": "keyword",
      "matched_terms": ["boat launch"]
    }
  ]
}

Here the phrase "boAt launch" (from the Cartesian product) appears in the text, so the entity is matched in the keyword stage with high confidence.


Example 2: Kalki section relevance

Request:

curl -X POST http://localhost:8000/kalki-match-entities \
  -H "Content-Type: application/json" \
  -d '{
    "headline": "Mahindra reveals XEV 9e Cineluxe Edition at 29.35 lakh",
    "body": "Opening paragraph covers market context only.\n\nMahindra later unveils the luxury special edition XEV 9e Cineluxe Edition with a 500 km range in March 2026.",
    "client_keywords": [
      "(\"Mahindra\") AND (\"XEV 9e Cineluxe Edition\" OR \"29.35\")",
      "(\"XEV 9e\" OR \"Cineluxe Edition\") AND (\"luxury\" OR \"special edition\" OR \"Launches\" OR \"Reveals\" OR \"Unveils\" OR \"29.35\" OR \"Introduces\" OR \"March 2026\" OR \"Exclusive\" OR \"500 km range\")"
    ]
  }'

Response:

{
  "IsRelevant": true,
  "IsTitleRelevant": true,
  "IsFirstParaRelevant": false,
  "IsRestOfArticleRelevant": true
}

The title matches immediately. The opening section does not satisfy the expressions, but the remainder of the body does.


Configuration

Relevant options in config.py:

Option Default Description
ENTITY_SNAPSHOT_JSON "data/entities_live.json" Primary runtime snapshot used by the matcher.
ENTITY_SNAPSHOT_META "data/entities_live_meta.json" Sync metadata file with row counts and source URL.
KEYWORD_API_BASE_URL UAT keyword API URL Base URL for the remote keyword API.
KEYWORD_API_TIMEOUT_SECONDS 60 Timeout for keyword sync API calls.
KEYWORD_SYNC_MIN_ROWS 1000 Reject suspiciously small snapshots.
KEYWORD_SYNC_MIN_RATIO_VS_PREVIOUS 0.5 Reject snapshots that drop too far below the previous row count.
ADMIN_API_TOKEN "" Optional token required in X-Admin-Token for admin endpoints.
FORCE_CASE_SENSITIVE_TERMS "" Optional comma-separated exact terms to force case-sensitive matching.
FORCE_CASE_INSENSITIVE_TERMS "" Optional comma-separated exact terms to force case-insensitive matching.

Snapshot rows use: EntityId, EntityName, EntityKeyword. The generated snapshot also stores CreatedOn and MediaType for audit/debug purposes.


Evaluation and web UI


Summary

The system is intentionally simple:

There is no semantic retrieval or secondary verification stage in the current application.