Re-ranking semántico

La búsqueda por palabras clave, comúnmente utilizada en motores de búsqueda, prioriza la similitud de palabras entre la consulta del usuario y los documentos disponibles, pero suele ignorar la semántica de las palabras. Por ejemplo, no distinguiría entre “banco” como institución financiera y “banco” como asiento, ya que no comprende las diferencias contextuales y semánticas entre palabras. Por el contrario, la búsqueda semántica sí entiende estas diferencias, brindando resultados más precisos y relevantes.

Migrar completamente a sistemas de búsqueda semántica puede ser un reto para muchas empresas debido a la profunda integración de los sistemas basados en palabras clave. Una solución es re-rankear los resultados de la búsqueda por palabras clave usando un modelo de búsqueda semántica basado en word embeddings.

Cohere Rerank ofrece una solución de re-ranking semántico fácil de integrar en sistemas existentes, necesitando solo unas pocas líneas de código, permitiendo así una transición suave hacia métodos de búsqueda más avanzados y precisos.

Para obtener más información sobre Cohere Rerank y cómo puede beneficiar a tu sistema de búsqueda, te invito a visitar su blog.

Al integrar la precisión contextual de la búsqueda semántica con la eficacia de la búsqueda por palabras clave, podemos superar las limitaciones inherentes de cada método, logrando así resultados de búsqueda más relevantes, precisos y ricos en contexto.

Librerías

import cohere
from dotenv import load_dotenv
from langchain.retrievers import BM25Retriever
from langchain.schema import Document

from src.langchain_docs_loader import load_langchain_docs_splitted

load_dotenv()
True

Carga de datos

docs = load_langchain_docs_splitted()

Creación de herramienta de búsqueda por palabras clave

BM25 es una función de ranking avanzada usada para clasificar documentos en sistemas de recuperación de información, basándose en su relevancia respecto a una consulta de búsqueda. A diferencia de la búsqueda por palabras clave básica, que solo considera la presencia o ausencia de palabras, BM25 calcula un score de relevancia para cada documento, teniendo en cuenta la frecuencia de aparición del término y su rareza en la colección de documentos.

keywordk_retriever = BM25Retriever.from_documents(docs)


def keyword_document_search(query: str, k: int) -> list[Document]:
    keywordk_retriever.k = k
    return keywordk_retriever.get_relevant_documents(query)

Búsqueda de documentos relevantes por palabras clave

relevant_keyword_documents = keyword_document_search(
    query="How to integrate LCEL into my Retrieval augmented generation system with a keyword search retriever?",
    k=100,
)

print("Keyword search results:")
for i, document in enumerate(relevant_keyword_documents):
    print(f"{i+1}. {document.metadata['source']}")
Keyword search results:
1. https://python.langchain.com/docs/integrations/memory/remembrall
2. https://python.langchain.com/docs/expression_language/
3. https://python.langchain.com/docs/modules/memory/adding_memory
4. https://python.langchain.com/docs/expression_language/cookbook/
5. https://python.langchain.com/docs/modules/memory/
6. https://python.langchain.com/docs/additional_resources/tutorials
7. https://python.langchain.com/docs/additional_resources/tutorials
8. https://python.langchain.com/docs/use_cases/question_answering/how_to/code/twitter-the-algorithm-analysis-deeplake
9. https://python.langchain.com/docs/integrations/memory/motorhead_memory
10. https://python.langchain.com/docs/use_cases/question_answering/how_to/chat_vector_db
11. https://python.langchain.com/docs/guides/deployments/template_repos
12. https://python.langchain.com/docs/integrations/vectorstores/elasticsearch
13. https://python.langchain.com/docs/use_cases/question_answering/how_to/flare
14. https://python.langchain.com/docs/modules/memory/adding_memory
15. https://python.langchain.com/docs/integrations/providers/myscale
16. https://python.langchain.com/docs/guides/langsmith/
17. https://python.langchain.com/docs/use_cases/question_answering/how_to/multi_retrieval_qa_router
18. https://python.langchain.com/docs/use_cases/question_answering/how_to/code/
19. https://python.langchain.com/docs/integrations/document_loaders/reddit
20. https://python.langchain.com/docs/use_cases/more/agents/agent_simulations/multi_player_dnd
21. https://python.langchain.com/docs/modules/agents/agent_types/chat_conversation_agent
22. https://python.langchain.com/docs/expression_language/cookbook/tools
23. https://python.langchain.com/docs/integrations/vectorstores/neo4jvector
24. https://python.langchain.com/docs/integrations/document_loaders/dropbox
25. https://python.langchain.com/docs/integrations/providers/arangodb
26. https://python.langchain.com/docs/additional_resources/youtube
27. https://python.langchain.com/docs/use_cases/more/agents/agent_simulations/characters
28. https://python.langchain.com/docs/use_cases/more/agents/agent_simulations/characters
29. https://python.langchain.com/docs/integrations/chat/fireworks
30. https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa
31. https://python.langchain.com/docs/integrations/vectorstores/timescalevector
32. https://python.langchain.com/docs/integrations/chat/promptlayer_chatopenai
33. https://python.langchain.com/docs/modules/memory/types/buffer
34. https://python.langchain.com/docs/integrations/document_loaders/ifixit
35. https://python.langchain.com/docs/integrations/document_loaders/ifixit
36. https://python.langchain.com/docs/integrations/tools/metaphor_search
37. https://python.langchain.com/docs/integrations/vectorstores/elasticsearch
38. https://python.langchain.com/docs/use_cases/question_answering/how_to/flare
39. https://python.langchain.com/docs/integrations/tools
40. https://python.langchain.com/docs/integrations/chat/fireworks
41. https://python.langchain.com/docs/integrations/document_loaders/rss
42. https://python.langchain.com/docs/integrations/retrievers/docarray_retriever
43. https://python.langchain.com/docs/integrations/retrievers/elastic_search_bm25
44. https://python.langchain.com/docs/integrations/tools/dataforseo
45. https://python.langchain.com/docs/integrations/tools
46. https://python.langchain.com/docs/integrations/llms/fireworks
47. https://python.langchain.com/docs/integrations/callbacks/promptlayer
48. https://python.langchain.com/docs/use_cases/question_answering/how_to/conversational_retrieval_agents
49. https://python.langchain.com/docs/integrations/document_loaders/sitemap
50. https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/supabase_self_query
51. https://python.langchain.com/docs/modules/data_connection/
52. https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/prompts_pipelining
53. https://python.langchain.com/docs/integrations/vectorstores/marqo
54. https://python.langchain.com/docs/guides/privacy/presidio_data_anonymization/reversible
55. https://python.langchain.com/docs/guides/deployments/
56. https://python.langchain.com/docs/use_cases/question_answering/how_to/local_retrieval_qa
57. https://python.langchain.com/docs/integrations/document_loaders/hugging_face_dataset
58. https://python.langchain.com/docs/integrations/memory
59. https://python.langchain.com/docs/integrations/providers/neo4j
60. https://python.langchain.com/docs/modules/data_connection/retrievers/web_research
61. https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/myscale_self_query
62. https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/myscale_self_query
63. https://python.langchain.com/docs/integrations/text_embedding/sagemaker-endpoint
64. https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/timescalevector_self_query
65. https://python.langchain.com/docs/use_cases/more/agents/agent_simulations/camel_role_playing
66. https://python.langchain.com/docs/use_cases/more/agents/agents/camel_role_playing
67. https://python.langchain.com/docs/integrations/providers/yeagerai
68. https://python.langchain.com/docs/integrations/providers/langchain_decorators
69. https://python.langchain.com/docs/modules/data_connection/document_transformers/
70. https://python.langchain.com/docs/integrations/callbacks/streamlit
71. https://python.langchain.com/docs/integrations/llms/
72. https://python.langchain.com/docs/use_cases/chatbots
73. https://python.langchain.com/docs/integrations/providers/gpt4all
74. https://python.langchain.com/docs/use_cases/more/agents/agent_simulations/characters
75. https://python.langchain.com/docs/integrations/llms/replicate
76. https://python.langchain.com/docs/integrations/providers/weaviate
77. https://python.langchain.com/docs/integrations/toolkits/office365
78. https://python.langchain.com/docs/integrations/document_loaders/ifixit
79. https://python.langchain.com/docs/modules/memory/types/vectorstore_retriever_memory
80. https://python.langchain.com/docs/integrations/document_loaders
81. https://python.langchain.com/docs/integrations/llms/octoai
82. https://python.langchain.com/docs/integrations/vectorstores/supabase
83. https://python.langchain.com/docs/integrations/document_loaders/ifixit
84. https://python.langchain.com/docs/modules/agents/how_to/sharedmemory_for_tools
85. https://python.langchain.com/docs/use_cases/extraction
86. https://python.langchain.com/docs/modules/agents/how_to/add_memory_openai_functions
87. https://python.langchain.com/docs/integrations/toolkits/gmail
88. https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/timescalevector_self_query
89. https://python.langchain.com/docs/integrations/vectorstores
90. https://python.langchain.com/docs/integrations/tools/searchapi
91. https://python.langchain.com/docs/modules/data_connection/document_loaders/markdown
92. https://python.langchain.com/docs/use_cases/question_answering/how_to/local_retrieval_qa
93. https://python.langchain.com/docs/guides/safety/moderation
94. https://python.langchain.com/docs/integrations/llms/
95. https://python.langchain.com/docs/integrations/providers/searchapi
96. https://python.langchain.com/docs/integrations/document_loaders/blackboard
97. https://python.langchain.com/docs/integrations/retrievers
98. https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression/
99. https://python.langchain.com/docs/integrations/providers/motherduck
100. https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token

Re-ranking semántico de los documentos relevantes

Una vez que hemos obtenido los documentos más relevantes para nuestra consulta de búsqueda, podemos re-rankearlos usando un modelo de búsqueda semántica basado en word embeddings.

En este caso, utilizaremos Cohere Rerank para obtener los documentos más relevantes para nuestra consulta de búsqueda, re-rankeando los documentos obtenidos por BM25.

Para que Cohere Rerank funcione, necesitarás una cuenta de Cohere y un API key. Puedes obtener tu API key aquí.

co = cohere.Client()
reranked_hits = co.rerank(
    query="How to integrate LCEL into my Retrieval augmented generation system with a keyword search retriever?",
    documents=[doc.page_content for doc in relevant_keyword_documents],
    top_n=10,
    model="rerank-multilingual-v2.0",
)

print("Reranked results:")
for hit in reranked_hits:
    print(relevant_keyword_documents[hit.index].metadata["source"])
Reranked results:
https://python.langchain.com/docs/expression_language/cookbook/
https://python.langchain.com/docs/use_cases/question_answering/how_to/flare
https://python.langchain.com/docs/use_cases/question_answering/how_to/chat_vector_db
https://python.langchain.com/docs/modules/data_connection/
https://python.langchain.com/docs/modules/agents/agent_types/chat_conversation_agent
https://python.langchain.com/docs/integrations/vectorstores/neo4jvector
https://python.langchain.com/docs/use_cases/question_answering/how_to/multi_retrieval_qa_router
https://python.langchain.com/docs/integrations/memory/remembrall
https://python.langchain.com/docs/integrations/retrievers
https://python.langchain.com/docs/modules/data_connection/retrievers/web_research