Entity Matcher API

The application is a keyword-based entity matcher. It identifies which entities are relevant to a news article using entity keyword expressions loaded from the current keyword snapshot (data/entities_live.json).

Overview

Input: article headline and body text.
Output: a list of matched entities with entity_id, confidence, source, and matched_terms.

The current production pipeline is a single stage:

Article (headline + body)
  -> Keyword matcher
  -> MatchResponse

All returned matches come from keyword detection and are emitted with:

source = "keyword"
confidence = 0.99

Keyword matching

Goal: Detect entities whose defined keyword expressions appear in the article text.

Entity definitions

Each entity has an EntityKeyword expression in the current keyword source, for example:

Simple OR: "IPO" OR "Initial Public Offering" OR "Healthcare" OR "Pharma"
AND + OR: boAt AND ("launch" OR "launched" OR "announces" OR "funding" OR "IPO" OR "market" OR ...)
Include + exclude: ("Consumer Loans" OR "consumer credit") NOT ("Sensex" OR "Nifty" OR "BSE")

Parsing logic (AND/OR/NOT and parentheses)

Expressions are parsed into a positive expression and an optional negative expression (respecting parentheses):

Split the expression by AND only when parentheses depth is 0.
For each AND operand, unwrap one pair of outer parentheses if present, then split by OR at top level.
If a top-level NOT is present, everything before it is the positive expression and everything after it is the negative expression.
Strip surrounding quotes from each term.
Matching rule:
every positive AND-group must be satisfied by at least one matched option
if the negative expression is satisfied, the rule is rejected

Example (matching semantics):

Expression: "Auto" AND ("B2B" OR "startup")
Groups: ["Auto"] and ["B2B", "startup"]
Match condition: article must contain "Auto" and at least one of "B2B"/"startup" as whole-word terms.

Example (NOT semantics):

Expression: ("Consumer Loans" OR "consumer credit") NOT ("Sensex" OR "Nifty" OR "BSE")
Match condition:
article must contain Consumer Loans or consumer credit
article must not satisfy the negative side by containing any of Sensex, Nifty, or BSE

Case and keyword matching rules

We use two Aho-Corasick keyword indices under the hood:

Case-insensitive index: for normal phrases.
Case-sensitive index: for acronyms / brand-style casing.

Heuristics:

Lowercase / normal title case → case-insensitive
Examples:
- "renewable energy"
- "Renewable Energy"
- "Solar Energy Solutions"
These match regardless of casing in the article (e.g. "renewable energy", "Renewable energy", "RENEWABLE ENERGY").
Short all-caps acronyms (≤ 3 letters) → case-sensitive
Examples:
- "LED", "EV", "OEM"
These only match when the article uses the same casing (e.g. "LED" does not match "led").
Brand-style mixed casing (uppercase after first letter) → case-sensitive
Examples:
- "boAt", "BoAt", "iPhone", "eBay"
These also only match when the article uses the same casing ("BoAt" does not match "boat").

There is no automatic plural expansion. If you need plurals like "IPOs" or "LEDs", they must appear explicitly in the keyword expression.

Implementation

Keyword matches are whole-word only: a phrase must not be embedded inside a larger word (e.g. "thar" will not match "Sitharaman", "bus" will not match "business").
Aho-Corasick is used for fast multi-pattern extraction:
one case-insensitive automaton for normal terms,
one case-sensitive automaton for acronyms / special casing.
All parsed terms are indexed; multiple entities can share the same term.
Rule evaluation is done by group-satisfaction: all AND-groups must be matched.

Result: Each entity that has at least one phrase matched in the text is returned with source="keyword" and confidence=0.99. There is no secondary semantic or LLM verification stage in the live system.

Keyword source and sync

The matcher consumes a local snapshot of the keyword API rather than calling the remote service on each request.

Runtime source order:

data/entities_live.json

Snapshot generation:

Remote full dump endpoint: GET /v1/mtrack/keywords?type=All
Local CSV import: python scripts/sync_keywords.py --csv-path /path/to/entities.csv
Sync script: scripts/sync_keywords.py
Generated files:
data/entities_live.json
data/entities_live_meta.json

Why this design:

request latency stays unchanged
the app keeps running if the remote API is temporarily unavailable
bad upstream responses do not overwrite the last-good snapshot
the web UI can continue reading a CSV-formatted keyword source via /data/entities.csv, generated from the live snapshot on demand

Runtime refresh support:

GET /admin/keywords/status
POST /admin/reload-keywords
POST /api/keywords/import/validate
POST /api/keywords/import/commit

The reload endpoint can fetch the remote full dump, update the snapshot files, and swap the in-memory matcher without restarting the process.

CSV import rules:

required columns: EntityId, EntityName, EntityKeyword
optional columns: CreatedOn, MediaType
duplicate EntityId rows are intentionally preserved as separate rules for the same entity

UI CSV import flow:

upload CSV on the home page
server validates the exact column format first
UI shows summary values:
rows
unique entity IDs
duplicate entity IDs
max rows per entity
import only runs after explicit confirmation

API reference

POST `/match-entities`

Finds entities relevant to the given article.

Request body (JSON):

Field	Type	Required	Description
`headline`	string	Yes	Article headline.
`body`	string	Yes	Article body text.

Optional query param:

Field	Type	Required	Description
`mediaType`	string	No	`Print` or `Online`. If `Print`, only `Print` + `Both` rules are considered. If `Online`, only `Online` + `Both` rules are considered. If omitted, all rules are considered.

Response (JSON):

Field	Type	Description
`matches`	array	List of matched entities (see below).

Each element of matches:

Field	Type	Description
`entity_id`	integer	Entity identifier.
`confidence`	float	`0.99` for keyword matches.
`source`	string	`"keyword"`.
`matched_terms`	array	Terms that triggered the match.

POST `/kalki-match-entities`

Evaluates whether the supplied client keyword expressions are relevant to the title and body sections of a single article.

Request body (JSON):

Field	Type	Required	Description
`headline`	string	Yes	Article headline.
`body`	string	Yes	Article body text.
`client_keywords`	array[string]	Yes	Keyword expressions using the same `AND` / `OR` / parentheses syntax as the entity matcher.

Supported exclusion syntax:

NOT ("term A" OR "term B")
minus-style exclusions such as -"term A" -"term B"

Response (JSON):

Field	Type	Description
`IsRelevant`	bool	`true` if any of the section-level relevance flags is `true`.
`IsTitleRelevant`	bool	`true` if any client keyword expression matches the headline.
`IsFirstParaRelevant`	bool	`true` if any client keyword expression matches the first body section.
`IsRestOfArticleRelevant`	bool	`true` if any client keyword expression matches the full headline + body context.

Section logic:

Each item in client_keywords is evaluated independently using the same AND / OR matching semantics as the entity API.
Array-level behavior is OR: if any expression matches a section, that section is relevant.
The first body section is:
the first paragraph if the body contains multiple paragraphs, or
the first 25% of body text if the body is a single block.
IsRestOfArticleRelevant uses the full article context (headline + body) so a match can span across sections while IsTitleRelevant and IsFirstParaRelevant remain section-specific signals.

Examples

Example 1: Keyword match (company + topic)

Entity (conceptual): "boAt" with expression
boAt AND ("launch" OR "launched" OR "funding" OR "IPO" OR "market")

Request:

curl -X POST http://localhost:8000/match-entities \
  -H "Content-Type: application/json" \
  -d '{
    "headline": "boAt launches new smartwatch",
    "body": "The company announced the product at an event in Delhi. Market reaction was positive."
  }'

Response (conceptually):

{
  "matches": [
    {
      "entity_id": 25683,
      "confidence": 0.99,
      "source": "keyword",
      "matched_terms": ["boat launch"]
    }
  ]
}

Here the phrase "boAt launch" (from the Cartesian product) appears in the text, so the entity is matched in the keyword stage with high confidence.

Example 2: Kalki section relevance

Request:

curl -X POST http://localhost:8000/kalki-match-entities \
  -H "Content-Type: application/json" \
  -d '{
    "headline": "Mahindra reveals XEV 9e Cineluxe Edition at 29.35 lakh",
    "body": "Opening paragraph covers market context only.\n\nMahindra later unveils the luxury special edition XEV 9e Cineluxe Edition with a 500 km range in March 2026.",
    "client_keywords": [
      "(\"Mahindra\") AND (\"XEV 9e Cineluxe Edition\" OR \"29.35\")",
      "(\"XEV 9e\" OR \"Cineluxe Edition\") AND (\"luxury\" OR \"special edition\" OR \"Launches\" OR \"Reveals\" OR \"Unveils\" OR \"29.35\" OR \"Introduces\" OR \"March 2026\" OR \"Exclusive\" OR \"500 km range\")"
    ]
  }'

Response:

{
  "IsRelevant": true,
  "IsTitleRelevant": true,
  "IsFirstParaRelevant": false,
  "IsRestOfArticleRelevant": true
}

The title matches immediately. The opening section does not satisfy the expressions, but the remainder of the body does.

Configuration

Relevant options in config.py:

Option	Default	Description
`ENTITY_SNAPSHOT_JSON`	`"data/entities_live.json"`	Primary runtime snapshot used by the matcher.
`ENTITY_SNAPSHOT_META`	`"data/entities_live_meta.json"`	Sync metadata file with row counts and source URL.
`KEYWORD_API_BASE_URL`	UAT keyword API URL	Base URL for the remote keyword API.
`KEYWORD_API_TIMEOUT_SECONDS`	`60`	Timeout for keyword sync API calls.
`KEYWORD_SYNC_MIN_ROWS`	`1000`	Reject suspiciously small snapshots.
`KEYWORD_SYNC_MIN_RATIO_VS_PREVIOUS`	`0.5`	Reject snapshots that drop too far below the previous row count.
`ADMIN_API_TOKEN`	`""`	Optional token required in `X-Admin-Token` for admin endpoints.
`FORCE_CASE_SENSITIVE_TERMS`	`""`	Optional comma-separated exact terms to force case-sensitive matching.
`FORCE_CASE_INSENSITIVE_TERMS`	`""`	Optional comma-separated exact terms to force case-insensitive matching.

Snapshot rows use: EntityId, EntityName, EntityKeyword. The generated snapshot also stores CreatedOn and MediaType for audit/debug purposes.

Evaluation and web UI

Evaluation: Upload an Excel file with columns NewsId, Headline, Newstext, EntityIds (comma-separated gold entity IDs). The app runs the same matching pipeline per row and writes a results CSV (TP/FP/FN, etc.).
Web UI: From the home page (/) you can upload an evaluation file and run evals; previous runs are listed. "View results" opens the results viewer for that run (/view?run_id=...), where you can inspect per-article matches and keyword highlights.
Entity test UI: /try sends headline/body to /match-entities and shows detected entities with matched terms.
Kalki test UI: /try-kalki sends headline/body/client keyword expressions to /kalki-match-entities and shows the four relevance flags plus the exact request/response payloads.

Summary

The system is intentionally simple:

entities are loaded from CSV
entities are normally loaded from the generated keyword snapshot
keyword rules are parsed into AND/OR groups
Aho-Corasick finds matching terms in article text
matched entities are returned directly

There is no semantic retrieval or secondary verification stage in the current application.