The application is a keyword-based entity matcher. It identifies which entities are relevant to a news article using entity keyword expressions loaded from the current keyword snapshot (data/entities_live.json).
Input: article headline and body text.
Output: a list of matched entities with entity_id, confidence, source, and matched_terms.
The current production pipeline is a single stage:
Article (headline + body)
-> Keyword matcher
-> MatchResponse
All returned matches come from keyword detection and are emitted with:
source = "keyword"confidence = 0.99Goal: Detect entities whose defined keyword expressions appear in the article text.
Each entity has an EntityKeyword expression in the current keyword source, for example:
"IPO" OR "Initial Public Offering" OR "Healthcare" OR "Pharma"boAt AND ("launch" OR "launched" OR "announces" OR "funding" OR "IPO" OR "market" OR ...)("Consumer Loans" OR "consumer credit") NOT ("Sensex" OR "Nifty" OR "BSE")Expressions are parsed into a positive expression and an optional negative expression (respecting parentheses):
AND only when parentheses depth is 0.OR at top level.NOT is present, everything before it is the positive expression and everything after it is the negative expression.Example (matching semantics):
"Auto" AND ("B2B" OR "startup")["Auto"] and ["B2B", "startup"]"Auto" and at least one of "B2B"/"startup" as whole-word terms.Example (NOT semantics):
("Consumer Loans" OR "consumer credit") NOT ("Sensex" OR "Nifty" OR "BSE")Consumer Loans or consumer creditSensex, Nifty, or BSEWe use two Aho-Corasick keyword indices under the hood:
Heuristics:
"renewable energy""Renewable Energy""Solar Energy Solutions"These match regardless of casing in the article (e.g. "renewable energy", "Renewable energy", "RENEWABLE ENERGY").
Short all-caps acronyms (≤ 3 letters) → case-sensitive
"LED", "EV", "OEM"These only match when the article uses the same casing (e.g. "LED" does not match "led").
Brand-style mixed casing (uppercase after first letter) → case-sensitive
"boAt", "BoAt", "iPhone", "eBay""BoAt" does not match "boat").There is no automatic plural expansion. If you need plurals like "IPOs" or "LEDs", they must appear explicitly in the keyword expression.
"thar" will not match "Sitharaman", "bus" will not match "business").Result: Each entity that has at least one phrase matched in the text is returned with source="keyword" and confidence=0.99. There is no secondary semantic or LLM verification stage in the live system.
The matcher consumes a local snapshot of the keyword API rather than calling the remote service on each request.
Runtime source order:
data/entities_live.jsonSnapshot generation:
GET /v1/mtrack/keywords?type=Allpython scripts/sync_keywords.py --csv-path /path/to/entities.csvscripts/sync_keywords.pydata/entities_live.jsondata/entities_live_meta.jsonWhy this design:
/data/entities.csv, generated from the live snapshot on demandRuntime refresh support:
GET /admin/keywords/statusPOST /admin/reload-keywordsPOST /api/keywords/import/validatePOST /api/keywords/import/commitThe reload endpoint can fetch the remote full dump, update the snapshot files, and swap the in-memory matcher without restarting the process.
CSV import rules:
EntityId, EntityName, EntityKeywordCreatedOn, MediaTypeEntityId rows are intentionally preserved as separate rules for the same entityUI CSV import flow:
/match-entitiesFinds entities relevant to the given article.
Request body (JSON):
| Field | Type | Required | Description |
|---|---|---|---|
headline |
string | Yes | Article headline. |
body |
string | Yes | Article body text. |
Optional query param:
| Field | Type | Required | Description |
|---|---|---|---|
mediaType |
string | No | Print or Online. If Print, only Print + Both rules are considered. If Online, only Online + Both rules are considered. If omitted, all rules are considered. |
Response (JSON):
| Field | Type | Description |
|---|---|---|
matches |
array | List of matched entities (see below). |
Each element of matches:
| Field | Type | Description |
|---|---|---|
entity_id |
integer | Entity identifier. |
confidence |
float | 0.99 for keyword matches. |
source |
string | "keyword". |
matched_terms |
array | Terms that triggered the match. |
/kalki-match-entitiesEvaluates whether the supplied client keyword expressions are relevant to the title and body sections of a single article.
Request body (JSON):
| Field | Type | Required | Description |
|---|---|---|---|
headline |
string | Yes | Article headline. |
body |
string | Yes | Article body text. |
client_keywords |
array[string] | Yes | Keyword expressions using the same AND / OR / parentheses syntax as the entity matcher. |
Supported exclusion syntax:
NOT ("term A" OR "term B")-"term A" -"term B"Response (JSON):
| Field | Type | Description |
|---|---|---|
IsRelevant |
bool | true if any of the section-level relevance flags is true. |
IsTitleRelevant |
bool | true if any client keyword expression matches the headline. |
IsFirstParaRelevant |
bool | true if any client keyword expression matches the first body section. |
IsRestOfArticleRelevant |
bool | true if any client keyword expression matches the full headline + body context. |
Section logic:
client_keywords is evaluated independently using the same AND / OR matching semantics as the entity API.OR: if any expression matches a section, that section is relevant.IsRestOfArticleRelevant uses the full article context (headline + body) so a match can span across sections while IsTitleRelevant and IsFirstParaRelevant remain section-specific signals.Entity (conceptual): "boAt" with expression
boAt AND ("launch" OR "launched" OR "funding" OR "IPO" OR "market")
Request:
curl -X POST http://localhost:8000/match-entities \
-H "Content-Type: application/json" \
-d '{
"headline": "boAt launches new smartwatch",
"body": "The company announced the product at an event in Delhi. Market reaction was positive."
}'
Response (conceptually):
{
"matches": [
{
"entity_id": 25683,
"confidence": 0.99,
"source": "keyword",
"matched_terms": ["boat launch"]
}
]
}
Here the phrase "boAt launch" (from the Cartesian product) appears in the text, so the entity is matched in the keyword stage with high confidence.
Request:
curl -X POST http://localhost:8000/kalki-match-entities \
-H "Content-Type: application/json" \
-d '{
"headline": "Mahindra reveals XEV 9e Cineluxe Edition at 29.35 lakh",
"body": "Opening paragraph covers market context only.\n\nMahindra later unveils the luxury special edition XEV 9e Cineluxe Edition with a 500 km range in March 2026.",
"client_keywords": [
"(\"Mahindra\") AND (\"XEV 9e Cineluxe Edition\" OR \"29.35\")",
"(\"XEV 9e\" OR \"Cineluxe Edition\") AND (\"luxury\" OR \"special edition\" OR \"Launches\" OR \"Reveals\" OR \"Unveils\" OR \"29.35\" OR \"Introduces\" OR \"March 2026\" OR \"Exclusive\" OR \"500 km range\")"
]
}'
Response:
{
"IsRelevant": true,
"IsTitleRelevant": true,
"IsFirstParaRelevant": false,
"IsRestOfArticleRelevant": true
}
The title matches immediately. The opening section does not satisfy the expressions, but the remainder of the body does.
Relevant options in config.py:
| Option | Default | Description |
|---|---|---|
ENTITY_SNAPSHOT_JSON |
"data/entities_live.json" |
Primary runtime snapshot used by the matcher. |
ENTITY_SNAPSHOT_META |
"data/entities_live_meta.json" |
Sync metadata file with row counts and source URL. |
KEYWORD_API_BASE_URL |
UAT keyword API URL | Base URL for the remote keyword API. |
KEYWORD_API_TIMEOUT_SECONDS |
60 |
Timeout for keyword sync API calls. |
KEYWORD_SYNC_MIN_ROWS |
1000 |
Reject suspiciously small snapshots. |
KEYWORD_SYNC_MIN_RATIO_VS_PREVIOUS |
0.5 |
Reject snapshots that drop too far below the previous row count. |
ADMIN_API_TOKEN |
"" |
Optional token required in X-Admin-Token for admin endpoints. |
FORCE_CASE_SENSITIVE_TERMS |
"" |
Optional comma-separated exact terms to force case-sensitive matching. |
FORCE_CASE_INSENSITIVE_TERMS |
"" |
Optional comma-separated exact terms to force case-insensitive matching. |
Snapshot rows use: EntityId, EntityName, EntityKeyword. The generated snapshot also stores CreatedOn and MediaType for audit/debug purposes.
/) you can upload an evaluation file and run evals; previous runs are listed. "View results" opens the results viewer for that run (/view?run_id=...), where you can inspect per-article matches and keyword highlights./try sends headline/body to /match-entities and shows detected entities with matched terms./try-kalki sends headline/body/client keyword expressions to /kalki-match-entities and shows the four relevance flags plus the exact request/response payloads.The system is intentionally simple:
There is no semantic retrieval or secondary verification stage in the current application.