A two-stage entity matching system that finds which entities (companies, people, topics) are relevant to a given article (headline + body). In the current deployment, only keyword matching is active; semantic search and cross-encoder verification are implemented but disabled.
Input: An article (headline + body text).
Output: A list of matched entities, each with entity_id, confidence, source, and matched_terms.
The pipeline is staged conceptually, but only Stage 1 (keyword matching) is currently active in production. Stages 2 and 3 remain in the codebase but are not executed.
So: keyword matches are trusted immediately, and at present they are the only kind of match produced by the API.
Article (headline + body)
│
▼
┌─────────────────────┐
│ 1. Keyword stage │ → Aho-Corasick + AND/OR group index
│ (phrase match) │
└─────────────────────┘
│
▼
MatchResponse { matches: [ EntityMatch, ... ] }
Goal: Detect entities whose defined keyword expressions appear in the article text.
Each entity has an EntityKeyword expression in the data (e.g. in entities.csv), for example:
"IPO" OR "Initial Public Offering" OR "Healthcare" OR "Pharma"boAt AND ("launch" OR "launched" OR "announces" OR "funding" OR "IPO" OR "market" OR ...)Expressions are parsed into AND-groups of OR-options (respecting parentheses):
AND only when parentheses depth is 0.OR at top level.Example (matching semantics):
"Auto" AND ("B2B" OR "startup")["Auto"] and ["B2B", "startup"]"Auto" and at least one of "B2B"/"startup" as whole-word terms.We use two Aho-Corasick keyword indices under the hood:
Heuristics:
"renewable energy""Renewable Energy""Solar Energy Solutions"These match regardless of casing in the article (e.g. "renewable energy", "Renewable energy", "RENEWABLE ENERGY").
Short all-caps acronyms (≤ 3 letters) → case-sensitive
"LED", "EV", "OEM"These only match when the article uses the same casing (e.g. "LED" does not match "led").
Brand-style mixed casing (uppercase after first letter) → case-sensitive
"boAt", "BoAt", "iPhone", "eBay""BoAt" does not match "boat").There is no automatic plural expansion. If you need plurals like "IPOs" or "LEDs", they must appear explicitly in the keyword expression.
"thar" will not match "Sitharaman", "bus" will not match "business").Result: Each entity that has at least one phrase matched in the text is returned with source="keyword" and confidence=0.99. There is no secondary semantic or LLM verification stage in the live system.
Goal: Find entities that are semantically related to the article even when no keyword phrase appears (paraphrases, related concepts).
EntityName + " " + EntityKeyword (name and full keyword expression). This is embedded once at startup.source="semantic" and confidence=score.source="llm" and confidence=0.90.So semantic matching uses a two-tier threshold: high confidence → accept; medium → verify.
Goal: For entities in the "review" band, decide whether the article is actually about that entity (avoid semantic drift).
cross-encoder/ms-marco-MiniLM-L-6-v2) that scores a (query, passage) pair. No external LLM API; runs locally.(entity_name, article_snippet). The snippet is the first 1500 characters of the article to keep inference fast and within context.source="llm" and confidence=0.90. Otherwise it is rejected.This step reduces false positives when the embedding similarity is only moderate.
/match-entitiesFinds entities relevant to the given article.
Request body (JSON):
| Field | Type | Required | Description |
|---|---|---|---|
headline |
string | Yes | Article headline. |
body |
string | Yes | Article body text. |
Response (JSON):
| Field | Type | Description |
|---|---|---|
matches |
array | List of matched entities (see below). |
Each element of matches:
| Field | Type | Description |
|---|---|---|
entity_id |
integer | Entity identifier. |
confidence |
float | 0.99 for keyword; raw score for semantic; 0.90 for verified semantic. |
source |
string | "keyword" | "semantic" | "llm" (llm = cross-encoder verified). |
matched_terms |
array | Phrases/terms that triggered the match (keyword phrase, or semantic reason). |
Entity (conceptual): "boAt" with expression
boAt AND ("launch" OR "launched" OR "funding" OR "IPO" OR "market")
Request:
curl -X POST http://localhost:8000/match-entities \
-H "Content-Type: application/json" \
-d '{
"headline": "boAt launches new smartwatch",
"body": "The company announced the product at an event in Delhi. Market reaction was positive."
}'
Response (conceptually):
{
"matches": [
{
"entity_id": 25683,
"confidence": 0.99,
"source": "keyword",
"matched_terms": ["boat launch"]
}
]
}
Here the phrase "boAt launch" (from the Cartesian product) appears in the text, so the entity is matched in the keyword stage with high confidence.
Entity: Has keyword "IPO" (abbreviation: length ≤ 5, single token). The index also contains "IPOs" and "IPOes" mapping to the same entity.
Request:
curl -X POST http://localhost:8000/match-entities \
-H "Content-Type: application/json" \
-d '{
"headline": "Renewed interest in IPOs",
"body": "Investors are returning to the primary market. Many companies are planning IPOs this year."
}'
Response (conceptually):
{
"matches": [
{
"entity_id": 26103,
"confidence": 0.99,
"source": "keyword",
"matched_terms": ["ipo"]
}
]
}
"IPOs" in the text matches the abbreviation plural rule, so the entity is still returned as a keyword match.
Entity: "Technology Development Board" with keywords about grants, RDI fund, etc. Article mentions "TDB" and "first call for proposals" but not the exact phrase.
Request:
curl -X POST http://localhost:8000/match-entities \
-H "Content-Type: application/json" \
-d '{
"headline": "TDB announces first call for proposals",
"body": "The Technology Development Board has invited applications under the RDI fund..."
}'
If the embedding of the article is close enough to the entity's (name + keywords), the entity can appear in the semantic stage. If the score is above SEMANTIC_ACCEPT, it is returned with source="semantic" and confidence equal to that score. If it is in the review band, the cross-encoder decides; if it passes, the response has source="llm" and confidence=0.90.
Relevant options in config.py:
| Option | Default | Description |
|---|---|---|
ENTITY_FILE |
"data/entities.csv" |
Path to entity definitions (EntityId, EntityName, EntityKeyword). |
FORCE_CASE_SENSITIVE_TERMS |
"" |
Optional comma-separated exact terms to force case-sensitive matching. |
FORCE_CASE_INSENSITIVE_TERMS |
"" |
Optional comma-separated exact terms to force case-insensitive matching. |
SEMANTIC_TOP_K |
20 | Number of nearest entities to consider from FAISS. |
SEMANTIC_THRESHOLD |
0.70 | Minimum similarity to consider a semantic candidate. |
SEMANTIC_ACCEPT |
0.82 | Score ≥ this → accept without verification. |
SEMANTIC_REVIEW |
0.70 | Score in [REVIEW, ACCEPT) → send to cross-encoder. |
CROSS_ENCODER_MODEL |
"cross-encoder/ms-marco-MiniLM-L-6-v2" |
Model used for verification. |
CROSS_ENCODER_ACCEPT_THRESHOLD |
0.75 | Cross-encoder score ≥ this → accept entity. |
Entity CSV must have columns: EntityId, EntityName, EntityKeyword.
/) you can upload an evaluation file and run evals; previous runs are listed. "View results" opens the results viewer for that run (/view?run_id=...), where you can inspect per-article matches and keyword highlights.| Stage | What it does | Output / source |
|---|---|---|
| Keyword | AND/OR group index + Aho-Corasick whole-word matching | source: "keyword", confidence 0.99 |
| Semantic | BGE embeddings + FAISS; two thresholds (accept vs review) | source: "semantic" (high) or sent to verification |
| Verification | Cross-encoder on (entity name, article snippet) | source: "llm", confidence 0.90 |
Together, this gives precise phrase-based matches plus broader semantic coverage, with a local cross-encoder to control false positives in the middle band.