# Reranking Strategies Guide **Version**: 1.0 **Date**: 2025-11-14 **Module**: `aiecs.application.knowledge_graph.search.reranker_strategies` ## Overview This guide covers the built-in reranking strategies available in the knowledge graph search system. Each strategy uses different signals to score entity relevance, and they can be combined for optimal results. ## Built-in Strategies ### 1. TextSimilarityReranker Scores entities based on text similarity using BM25 and Jaccard similarity. ```python from aiecs.application.knowledge_graph.search.reranker_strategies import TextSimilarityReranker ``` #### How It Works - **BM25**: Term-based relevance scoring (TF-IDF variant) - **Jaccard**: Set overlap between query and entity text - **Combination**: Weighted combination of both scores #### Constructor ```python def __init__( self, bm25_weight: float = 0.7, jaccard_weight: float = 0.3, text_fields: Optional[List[str]] = None ) ``` **Parameters**: - `bm25_weight` (float): Weight for BM25 score (default: 0.7) - `jaccard_weight` (float): Weight for Jaccard score (default: 0.3) - `text_fields` (Optional[List[str]]): Entity fields to use for text (default: ["name", "description"]) #### Example ```python # Create text similarity reranker text_reranker = TextSimilarityReranker( bm25_weight=0.7, jaccard_weight=0.3, text_fields=["name", "description", "content"] ) # Score entities scores = await text_reranker.score( query="machine learning algorithms", entities=search_results ) ``` #### When to Use - ✅ Query contains specific keywords - ✅ Exact term matching is important - ✅ Entity text is rich and descriptive - ❌ Query is very short or generic - ❌ Semantic meaning is more important than keywords #### Performance - **Speed**: Fast (no external calls) - **Memory**: Low - **Accuracy**: Good for keyword-based queries --- ### 2. SemanticReranker Scores entities based on semantic similarity using vector embeddings. ```python from aiecs.application.knowledge_graph.search.reranker_strategies import SemanticReranker ``` #### How It Works - **Embeddings**: Uses entity embedding vectors - **Similarity**: Computes cosine similarity with query embedding - **Fallback**: Returns 0.5 if embeddings are missing #### Constructor ```python def __init__(self) ``` No parameters required. #### Example ```python # Create semantic reranker semantic_reranker = SemanticReranker() # Score entities (requires query_embedding) scores = await semantic_reranker.score( query="machine learning", entities=search_results, query_embedding=[0.1, 0.2, 0.3, ...] # Query vector ) ``` #### When to Use - ✅ Semantic meaning is important - ✅ Query and entities have embeddings - ✅ Handling synonyms and related concepts - ✅ Cross-lingual search - ❌ Embeddings are not available - ❌ Exact keyword matching is critical #### Performance - **Speed**: Fast (vector operations) - **Memory**: Medium (stores embeddings) - **Accuracy**: Excellent for semantic queries --- ### 3. StructuralReranker Scores entities based on graph structure (PageRank, centrality). ```python from aiecs.application.knowledge_graph.search.reranker_strategies import StructuralReranker ``` #### How It Works - **PageRank**: Scores based on entity importance in graph - **Degree Centrality**: Scores based on number of connections - **Combination**: Weighted combination of both metrics #### Constructor ```python def __init__( self, graph_store: GraphStore, pagerank_weight: float = 0.7, centrality_weight: float = 0.3, use_cache: bool = True ) ``` **Parameters**: - `graph_store` (GraphStore): Graph storage backend - `pagerank_weight` (float): Weight for PageRank score (default: 0.7) - `centrality_weight` (float): Weight for centrality score (default: 0.3) - `use_cache` (bool): Whether to cache PageRank scores (default: True) #### Example ```python # Create structural reranker structural_reranker = StructuralReranker( graph_store=store, pagerank_weight=0.7, centrality_weight=0.3 ) # Score entities scores = await structural_reranker.score( query="important entities", entities=search_results ) ``` #### When to Use - ✅ Entity importance matters - ✅ Well-connected entities are more relevant - ✅ Graph structure is meaningful - ❌ All entities are equally important - ❌ Graph is sparse or disconnected #### Performance - **Speed**: Medium (requires graph queries) - **Memory**: Medium (caches PageRank) - **Accuracy**: Good for authority-based ranking --- ### 4. HybridReranker Combines text, semantic, and structural signals into a single strategy. ```python from aiecs.application.knowledge_graph.search.reranker_strategies import HybridReranker ``` #### How It Works - **Multi-Signal**: Combines all three reranking approaches - **Weighted**: Configurable weights for each signal - **Normalized**: Normalizes scores before combining #### Constructor ```python def __init__( self, graph_store: GraphStore, text_weight: float = 0.4, semantic_weight: float = 0.4, structural_weight: float = 0.2, text_fields: Optional[List[str]] = None ) ``` **Parameters**: - `graph_store` (GraphStore): Graph storage backend - `text_weight` (float): Weight for text similarity (default: 0.4) - `semantic_weight` (float): Weight for semantic similarity (default: 0.4) - `structural_weight` (float): Weight for structural importance (default: 0.2) - `text_fields` (Optional[List[str]]): Entity fields for text similarity #### Example ```python # Create hybrid reranker hybrid_reranker = HybridReranker( graph_store=store, text_weight=0.4, semantic_weight=0.4, structural_weight=0.2 ) # Score entities scores = await hybrid_reranker.score( query="machine learning", entities=search_results, query_embedding=[0.1, 0.2, ...] ) ``` #### When to Use - ✅ Want comprehensive ranking - ✅ Multiple signals are available - ✅ Balanced approach is needed - ❌ Only one signal is available - ❌ Need fine-grained control over strategies #### Performance - **Speed**: Medium (combines all strategies) - **Memory**: Medium - **Accuracy**: Excellent for general-purpose ranking --- ## Strategy Comparison | Strategy | Speed | Memory | Best For | Requires | |----------|-------|--------|----------|----------| | TextSimilarity | Fast | Low | Keyword queries | Entity text | | Semantic | Fast | Medium | Semantic queries | Embeddings | | Structural | Medium | Medium | Authority ranking | Graph structure | | Hybrid | Medium | Medium | General purpose | All of above | --- ## Combining Strategies ### Using ResultReranker Combine multiple strategies with custom weights: ```python from aiecs.application.knowledge_graph.search.reranker import ( ResultReranker, ScoreCombinationMethod ) # Create individual strategies text_reranker = TextSimilarityReranker() semantic_reranker = SemanticReranker() structural_reranker = StructuralReranker(graph_store) # Combine with weighted average reranker = ResultReranker( strategies=[text_reranker, semantic_reranker, structural_reranker], combination_method=ScoreCombinationMethod.WEIGHTED_AVERAGE, weights={ "text": 0.4, "semantic": 0.4, "structural": 0.2 } ) ``` ### Combination Methods #### 1. Weighted Average (Recommended) Combines scores using weighted average: ```python combination_method=ScoreCombinationMethod.WEIGHTED_AVERAGE weights={"text": 0.6, "semantic": 0.4} ``` **When to use**: Most cases, allows fine-tuning importance #### 2. Reciprocal Rank Fusion (RRF) Combines based on ranks rather than scores: ```python combination_method=ScoreCombinationMethod.RRF ``` **When to use**: Scores are on different scales, want rank-based fusion #### 3. Max Score Takes maximum score across strategies: ```python combination_method=ScoreCombinationMethod.MAX ``` **When to use**: Want entities that excel in any strategy #### 4. Min Score Takes minimum score across strategies: ```python combination_method=ScoreCombinationMethod.MIN ``` **When to use**: Want entities that score well in all strategies --- ## Best Practices ### 1. Choose the Right Strategy ```python # For keyword-heavy queries if query_has_specific_terms: use TextSimilarityReranker # For semantic/conceptual queries if query_is_conceptual: use SemanticReranker # For authority-based ranking if importance_matters: use StructuralReranker # For general purpose else: use HybridReranker ``` ### 2. Tune Weights Start with default weights and adjust based on results: ```python # Default balanced weights weights = {"text": 0.4, "semantic": 0.4, "structural": 0.2} # Keyword-focused weights = {"text": 0.7, "semantic": 0.2, "structural": 0.1} # Semantic-focused weights = {"text": 0.2, "semantic": 0.7, "structural": 0.1} # Authority-focused weights = {"text": 0.3, "semantic": 0.3, "structural": 0.4} ``` ### 3. Normalize Scores Always normalize scores when combining strategies: ```python reranker = ResultReranker( strategies=[...], normalize_scores=True, # Important! normalization_method="min_max" ) ``` ### 4. Use Top-K Limiting Limit results for better performance: ```python reranked = await reranker.rerank( query=query, entities=entities, top_k=20 # Only return top 20 ) ``` ### 5. Cache When Possible Enable caching for structural reranker: ```python structural_reranker = StructuralReranker( graph_store=store, use_cache=True # Cache PageRank scores ) ``` --- ## Custom Strategies ### Creating a Custom Strategy Implement the `RerankerStrategy` interface: ```python from aiecs.application.knowledge_graph.search.reranker import RerankerStrategy from typing import List from aiecs.domain.knowledge_graph.models.entity import Entity class RecencyReranker(RerankerStrategy): """Rerank based on entity recency""" @property def name(self) -> str: return "recency" async def score( self, query: str, entities: List[Entity], **kwargs ) -> List[float]: """Score based on creation/update time""" scores = [] for entity in entities: # Get timestamp from entity metadata timestamp = entity.metadata.get("updated_at", 0) # Normalize to [0, 1] based on age age_days = (time.time() - timestamp) / 86400 score = 1.0 / (1.0 + age_days / 365) # Decay over year scores.append(score) return scores ``` ### Using Custom Strategy ```python # Create custom strategy recency_reranker = RecencyReranker() # Use with ResultReranker reranker = ResultReranker( strategies=[text_reranker, recency_reranker], weights={"text": 0.7, "recency": 0.3} ) ``` --- ## Use Cases ### Use Case 1: Academic Paper Search **Goal**: Find relevant papers with high citation count **Strategy**: ```python # Combine semantic similarity with structural importance reranker = ResultReranker( strategies=[ SemanticReranker(), StructuralReranker(graph_store) # Citations = high PageRank ], weights={ "semantic": 0.6, "structural": 0.4 # Emphasize citations } ) ``` ### Use Case 2: Product Search **Goal**: Find products matching keywords with good reviews **Strategy**: ```python # Combine text matching with custom review score class ReviewReranker(RerankerStrategy): @property def name(self) -> str: return "reviews" async def score(self, query, entities, **kwargs): return [ entity.metadata.get("review_score", 0.5) / 5.0 for entity in entities ] reranker = ResultReranker( strategies=[ TextSimilarityReranker(), ReviewReranker() ], weights={"text": 0.7, "reviews": 0.3} ) ``` ### Use Case 3: Expert Finding **Goal**: Find experts in a domain **Strategy**: ```python # Emphasize structural importance (connections, collaborations) reranker = ResultReranker( strategies=[ SemanticReranker(), StructuralReranker(graph_store) ], weights={ "semantic": 0.3, "structural": 0.7 # Emphasize network position } ) ``` ### Use Case 4: News Article Search **Goal**: Find relevant recent articles **Strategy**: ```python # Combine relevance with recency reranker = ResultReranker( strategies=[ TextSimilarityReranker(), SemanticReranker(), RecencyReranker() ], weights={ "text": 0.4, "semantic": 0.3, "recency": 0.3 # Boost recent articles } ) ``` --- ## Performance Optimization ### 1. Batch Processing Process multiple queries efficiently: ```python async def rerank_batch(queries, entities_list): """Rerank multiple query-entity pairs""" tasks = [ reranker.rerank(query, entities) for query, entities in zip(queries, entities_list) ] return await asyncio.gather(*tasks) ``` ### 2. Caching Cache expensive computations: ```python from functools import lru_cache class CachedStructuralReranker(StructuralReranker): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._pagerank_cache = {} async def _get_pagerank_scores(self, entity_ids): # Check cache first cache_key = tuple(sorted(entity_ids)) if cache_key in self._pagerank_cache: return self._pagerank_cache[cache_key] # Compute and cache scores = await super()._get_pagerank_scores(entity_ids) self._pagerank_cache[cache_key] = scores return scores ``` ### 3. Parallel Strategy Execution Execute strategies in parallel: ```python import asyncio async def parallel_rerank(query, entities): """Execute all strategies in parallel""" # Get scores from all strategies concurrently score_tasks = [ strategy.score(query, entities) for strategy in reranker.strategies ] all_scores = await asyncio.gather(*score_tasks) # Combine scores # ... (combination logic) ``` ### 4. Early Stopping Stop processing if top results are clear: ```python async def rerank_with_early_stop( query, entities, confidence_threshold=0.9 ): """Stop if top result has high confidence""" reranked = await reranker.rerank(query, entities) if reranked and reranked[0][1] > confidence_threshold: # Top result is very confident, return early return reranked[:10] return reranked ``` --- ## Troubleshooting ### Problem: Low Scores for All Entities **Cause**: Normalization issue or missing data **Solution**: ```python # Check raw scores before normalization reranker = ResultReranker( strategies=[...], normalize_scores=False # Disable to debug ) # Or use different normalization reranker = ResultReranker( strategies=[...], normalization_method="softmax" # Try different method ) ``` ### Problem: One Strategy Dominates **Cause**: Scores on different scales **Solution**: ```python # Always normalize scores reranker = ResultReranker( strategies=[...], normalize_scores=True, # Enable normalization normalization_method="min_max" ) # Or adjust weights weights = { "dominant_strategy": 0.3, # Reduce weight "other_strategy": 0.7 # Increase weight } ``` ### Problem: Slow Performance **Cause**: Expensive strategy computations **Solution**: ```python # Use caching structural_reranker = StructuralReranker( graph_store=store, use_cache=True ) # Limit results early reranked = await reranker.rerank( query=query, entities=entities[:100], # Limit input size top_k=20 ) # Use faster strategies # Replace SemanticReranker with TextSimilarityReranker if embeddings are slow ``` ### Problem: Missing Embeddings **Cause**: Entities don't have embedding vectors **Solution**: ```python # Provide fallback score class SafeSemanticReranker(SemanticReranker): async def score(self, query, entities, **kwargs): scores = [] for entity in entities: if entity.embedding: score = compute_similarity(query_emb, entity.embedding) else: score = 0.5 # Neutral score for missing embeddings scores.append(score) return scores ``` --- ## Testing Strategies ### Unit Testing ```python import pytest @pytest.mark.asyncio async def test_text_similarity_reranker(): """Test text similarity reranker""" reranker = TextSimilarityReranker() # Create test entities entities = [ Entity(id="1", name="Machine Learning", description="ML algorithms"), Entity(id="2", name="Deep Learning", description="Neural networks"), Entity(id="3", name="Cooking", description="Recipes and food") ] # Score entities scores = await reranker.score("machine learning", entities) # Verify scores assert len(scores) == 3 assert scores[0] > scores[2] # ML more relevant than cooking assert all(0 <= s <= 1 for s in scores) # Scores in valid range ``` ### Integration Testing ```python @pytest.mark.asyncio async def test_result_reranker_integration(): """Test full reranker pipeline""" reranker = ResultReranker( strategies=[ TextSimilarityReranker(), SemanticReranker() ], weights={"text": 0.6, "semantic": 0.4} ) # Rerank entities reranked = await reranker.rerank( query="machine learning", entities=test_entities, top_k=10 ) # Verify results assert len(reranked) <= 10 assert all(isinstance(item, tuple) for item in reranked) assert all(0 <= score <= 1 for _, score in reranked) # Verify sorted descending scores = [score for _, score in reranked] assert scores == sorted(scores, reverse=True) ``` --- ## Conclusion The reranking framework provides flexible, composable strategies for improving search result relevance. Key takeaways: - ✅ **Choose the right strategy** for your use case - ✅ **Combine strategies** for better results - ✅ **Tune weights** based on your data - ✅ **Normalize scores** when combining - ✅ **Cache expensive computations** - ✅ **Test thoroughly** before production For more information, see the [ResultReranker API Documentation](result-reranker-api.md).