Ensemble Retriever

El EnsembleRetriever toma una lista de buscadores (retrievers) como entrada y agrupa (ensemble) los resultados de sus métodos get_relevant_documents(), para después reorganizar los resultados usando el algoritmo de Fusión de Rango Recíproco (Reciprocal Rank Fusion).

Al aprovechar las fortalezas de diferentes algoritmos, el EnsembleRetriever puede lograr un rendimiento mejor que cualquier algoritmo individual.

Un patrón común es combinar un buscador disperso (como BM25) con un buscador denso (como similaridad de incrustación/embedding), ya que sus fortalezas son complementarias. Esto también se conoce como “búsqueda híbrida”.

Buscador Disperso (Sparse Retriever): Es eficaz para encontrar documentos relevantes basados en palabras clave.
Buscador Denso (Dense Retriever): Es eficaz para encontrar documentos relevantes basados en similitud semántica.

Librerías

from dotenv import load_dotenv
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import Chroma

from src.langchain_docs_loader import load_langchain_docs_splitted

load_dotenv()

True

Carga de datos

docs = load_langchain_docs_splitted()

Inicialización de retrievers independientes

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2

vector_retriever = Chroma.from_documents(
    docs, embedding=OpenAIEmbeddings()
).as_retriever(search_kwargs={"k": 2})

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 834322 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 742721 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 652165 / min. Contact us through our help center at help.openai.com if you continue to have issues..

Ensamblaje de retrievers

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever], weights=[0.5, 0.5]
)

ensemble_retriever.get_relevant_documents(
    "¿Cómo utilizar un retriever con langchain expression language?"
)

[Document(page_content='# Code understanding\n\nOverview\n\nLangChain is a useful tool designed to parse GitHub code repositories. By leveraging VectorStores, Conversational RetrieverChain, and GPT-4, it can answer questions in the context of an entire GitHub repository or generate new code. This documentation page outlines the essential components of the system and guides using LangChain for better code comprehension, contextual question answering, and code generation in GitHub repositories.\n\n## Conversational Retriever Chain\u200b\n\nConversational RetrieverChain is a retrieval-focused system that interacts with the data stored in a VectorStore. Utilizing advanced techniques, like context-aware filtering and ranking, it retrieves the most relevant code snippets and information for a given user query. Conversational RetrieverChain is engineered to deliver high-quality, pertinent results while considering conversation history and context.\n\nLangChain Workflow for Code Understanding and Generation\n\n1. Index the code base: Clone the target repository, load all files within, chunk the files, and execute the indexing process. Optionally, you can skip this step and use an already indexed dataset.\n\n2. Embedding and Code Store: Code snippets are embedded using a code-aware embedding model and stored in a VectorStore.\nQuery Understanding: GPT-4 processes user queries, grasping the context and extracting relevant details.\n\n3. Construct the Retriever: Conversational RetrieverChain searches the VectorStore to identify the most relevant code snippets for a given query.\n\n4. Build the Conversational Chain: Customize the retriever settings and define any user-defined filters as needed. \n\n5. Ask questions: Define a list of questions to ask about the codebase, and then use the ConversationalRetrievalChain to generate context-aware answers. The LLM (GPT-4) generates comprehensive, context-aware answers based on retrieved code snippets and conversation history.\n\nThe full tutorial is available below.\n\n- [Twitter the-algorithm codebase analysis with Deep Lake](/docs/use_cases/question_answering/how_to/code/twitter-the-algorithm-analysis-deeplake.html): A notebook walking through how to parse github source code and run queries conversation.\n- [LangChain codebase analysis with Deep Lake](/docs/use_cases/question_answering/how_to/code/code-analysis-deeplake.html): A notebook walking through how to analyze and do question answering over THIS code base.', metadata={'description': 'Overview', 'language': 'en', 'source': 'https://python.langchain.com/docs/use_cases/question_answering/how_to/code/', 'title': 'Code understanding | 🦜️🔗 Langchain'}),
 Document(page_content='## Output\u200b\n\nAfter translating a document, the result will be returned as a new document with the page_content translated into the target language\n\n```python\ntranslated_document = await qa_translator.atransform_documents(documents)\n```\n\n```python\nprint(translated_document[0].page_content)\n```\n\n```text\n    [Generado con ChatGPT]\n    \n    Documento confidencial - Solo para uso interno\n    \n    Fecha: 1 de julio de 2023\n    \n    Asunto: Actualizaciones y discusiones sobre varios temas\n    \n    Estimado equipo,\n    \n    Espero que este correo electrónico les encuentre bien. En este documento, me gustaría proporcionarles algunas actualizaciones importantes y discutir varios temas que requieren nuestra atención. Por favor, traten la información contenida aquí como altamente confidencial.\n    \n    Medidas de seguridad y privacidad\n    Como parte de nuestro compromiso continuo para garantizar la seguridad y privacidad de los datos de nuestros clientes, hemos implementado medidas robustas en todos nuestros sistemas. Nos gustaría elogiar a John Doe (correo electrónico: john.doe@example.com) del departamento de TI por su diligente trabajo en mejorar nuestra seguridad de red. En adelante, recordamos amablemente a todos que se adhieran estrictamente a nuestras políticas y directrices de protección de datos. Además, si se encuentran con cualquier riesgo de seguridad o incidente potencial, por favor repórtelo inmediatamente a nuestro equipo dedicado en security@example.com.\n    \n    Actualizaciones de RRHH y beneficios para empleados\n    Recientemente, dimos la bienvenida a varios nuevos miembros del equipo que han hecho contribuciones significativas a sus respectivos departamentos. Me gustaría reconocer a Jane Smith (SSN: 049-45-5928) por su sobresaliente rendimiento en el servicio al cliente. Jane ha recibido constantemente comentarios positivos de nuestros clientes. Además, recuerden que el período de inscripción abierta para nuestro programa de beneficios para empleados se acerca rápidamente. Si tienen alguna pregunta o necesitan asistencia, por favor contacten a nuestro representante de RRHH, Michael Johnson (teléfono: 418-492-3850, correo electrónico: michael.johnson@example.com).\n    \n    Iniciativas y campañas de marketing\n    Nuestro equipo de marketing ha estado trabajando activamente en el desarrollo de nuevas estrategias para aumentar la conciencia de marca y fomentar la participación del cliente. Nos gustaría agradecer a Sarah Thompson (teléfono: 415-555-1234) por sus excepcionales esfuerzos en la gestión de nuestras plataformas de redes sociales. Sarah ha aumentado con éxito nuestra base de seguidores en un 20% solo en el último mes. Además, por favor marquen sus calendarios para el próximo evento de lanzamiento de producto el 15 de julio. Animamos a todos los miembros del equipo a asistir y apoyar este emocionante hito para nuestra empresa.\n    \n    Proyectos de investigación y desarrollo\n    En nuestra búsqueda de la innovación, nuestro departamento de investigación y desarrollo ha estado trabajando incansablemente en varios proyectos. Me gustaría reconocer el excepcional trabajo de David Rodríguez (correo electrónico: david.rodriguez@example.com) en su papel de líder de proyecto. Las contribuciones de David al desarrollo de nuestra tecnología de vanguardia han sido fundamentales. Además, nos gustaría recordar a todos que compartan sus ideas y sugerencias para posibles nuevos proyectos durante nuestra sesión de lluvia de ideas de I+D mensual, programada para el 10 de julio.\n    \n    Por favor, traten la información de este documento con la máxima confidencialidad y asegúrense de que no se comparte con personas no autorizadas. Si tienen alguna pregunta o inquietud sobre los temas discutidos, no duden en ponerse en contacto conmigo directamente.\n    \n    Gracias por su atención, y sigamos trabajando juntos para alcanzar nuestros objetivos.\n    \n    Saludos cordiales,\n    \n    Jason Fan\n    Cofundador y CEO\n    Psychic\n    jason@psychic.dev\n```', metadata={'source': 'https://python.langchain.com/docs/integrations/document_transformers/doctran_translate_document', 'title': 'Doctran: language translation | 🦜️🔗 Langchain', 'description': 'Comparing documents through embeddings has the benefit of working across multiple languages. "Harrison says hello" and "Harrison dice hola" will occupy similar positions in the vector space because they have the same meaning semantically.', 'language': 'en'}),
 Document(page_content='# Tagging\n\n[](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/tagging.ipynb)\n\n## Use case\u200b\n\nTagging means labeling a document with classes such as:\n\n- sentiment\n- language\n- style (formal, informal etc.)\n- covered topics\n- political tendency\n\n![Image description](/assets/images/tagging-93990e95451d92b715c2b47066384224.png)\n\n## Overview\u200b\n\nTagging has a few components:\n\n- `function`: Like [extraction](/docs/use_cases/extraction), tagging uses [functions](https://openai.com/blog/function-calling-and-other-api-updates) to specify how the model should tag a document\n- `schema`: defines how we want to tag the document\n\n## Quickstart\u200b\n\nLet\'s see a very straightforward example of how we can use OpenAI functions for tagging in LangChain.\n\n```bash\npip install langchain openai \n\n# Set env var OPENAI_API_KEY or load from a .env file:\n# import dotenv\n# dotenv.load_dotenv()\n```\n\n```python\nfrom langchain.chat_models import ChatOpenAI\nfrom langchain.prompts import ChatPromptTemplate\nfrom langchain.chains import create_tagging_chain, create_tagging_chain_pydantic\n```\n\n> **API Reference:**\n> - [ChatOpenAI](https://api.python.langchain.com/en/latest/chat_models/langchain.chat_models.openai.ChatOpenAI.html)\n> - [ChatPromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.ChatPromptTemplate.html)\n> - [create_tagging_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.openai_functions.tagging.create_tagging_chain.html)\n> - [create_tagging_chain_pydantic](https://api.python.langchain.com/en/latest/chains/langchain.chains.openai_functions.tagging.create_tagging_chain_pydantic.html)\n\nWe specify a few properties with their expected type in our schema.\n\n```python\n# Schema\nschema = {\n    "properties": {\n        "sentiment": {"type": "string"},\n        "aggressiveness": {"type": "integer"},\n        "language": {"type": "string"},\n    }\n}\n\n# LLM\nllm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")\nchain = create_tagging_chain(schema, llm)\n```\n\n```python\ninp = "Estoy increiblemente contento de haberte conocido! Creo que seremos muy buenos amigos!"\nchain.run(inp)\n```\n\n```text\n    {\'sentiment\': \'positive\', \'language\': \'Spanish\'}\n```\n\n```python\ninp = "Estoy muy enojado con vos! Te voy a dar tu merecido!"\nchain.run(inp)\n```\n\n```text\n    {\'sentiment\': \'enojado\', \'aggressiveness\': 1, \'language\': \'es\'}\n```\n\nAs we can see in the examples, it correctly interprets what we want.\n\nThe results vary so that we get, for example, sentiments in different languages (\'positive\', \'enojado\' etc.).\n\nWe will see how to control these results in the next section.', metadata={'source': 'https://python.langchain.com/docs/use_cases/tagging', 'title': 'Tagging | 🦜️🔗 Langchain', 'description': 'Open In Collab', 'language': 'en'}),
 Document(page_content="# Retrieval\n\nMany LLM applications require user-specific data that is not part of the model's training set.\nThe primary way of accomplishing this is through Retrieval Augmented Generation (RAG).\nIn this process, external data is _retrieved_ and then passed to the LLM when doing the _generation_ step.\n\nLangChain provides all the building blocks for RAG applications - from simple to complex.\nThis section of the documentation covers everything related to the _retrieval_ step - e.g. the fetching of the data.\nAlthough this sounds simple, it can be subtly complex.\nThis encompasses several key modules.\n\n![data_connection_diagram](/assets/images/data_connection-c42d68c3d092b85f50d08d4cc171fc25.jpg)\n\n**Document loaders**\n\nLoad documents from many different sources.\nLangChain provides over 100 different document loaders as well as integrations with other major providers in the space,\nlike AirByte and Unstructured.\nWe provide integrations to load all types of documents (HTML, PDF, code) from all types of locations (private s3 buckets, public websites).\n\n**Document transformers**\n\nA key part of retrieval is fetching only the relevant parts of documents.\nThis involves several transformation steps in order to best prepare the documents for retrieval.\nOne of the primary ones here is splitting (or chunking) a large document into smaller chunks.\nLangChain provides several different algorithms for doing this, as well as logic optimized for specific document types (code, markdown, etc).\n\n**Text embedding models**\n\nAnother key part of retrieval has become creating embeddings for documents.\nEmbeddings capture the semantic meaning of the text, allowing you to quickly and\nefficiently find other pieces of text that are similar.\nLangChain provides integrations with over 25 different embedding providers and methods,\nfrom open-source to proprietary API,\nallowing you to choose the one best suited for your needs.\nLangChain provides a standard interface, allowing you to easily swap between models.\n\n**Vector stores**\n\nWith the rise of embeddings, there has emerged a need for databases to support efficient storage and searching of these embeddings.\nLangChain provides integrations with over 50 different vectorstores, from open-source local ones to cloud-hosted proprietary ones,\nallowing you to choose the one best suited for your needs.\nLangChain exposes a standard interface, allowing you to easily swap between vector stores.\n\n**Retrievers**\n\nOnce the data is in the database, you still need to retrieve it.\nLangChain supports many different retrieval algorithms and is one of the places where we add the most value.\nWe support basic methods that are easy to get started - namely simple semantic search.\nHowever, we have also added a collection of algorithms on top of this to increase performance.\nThese include:\n\n- [Parent Document Retriever](/docs/modules/data_connection/retrievers/parent_document_retriever): This allows you to create multiple embeddings per parent document, allowing you to look up smaller chunks but return larger context.\n- [Self Query Retriever](/docs/modules/data_connection/retrievers/self_query): User questions often contain a reference to something that isn't just semantic but rather expresses some logic that can best be represented as a metadata filter. Self-query allows you to parse out the _semantic_ part of a query from other _metadata filters_ present in the query.\n- [Ensemble Retriever](/docs/modules/data_connection/retrievers/ensemble): Sometimes you may want to retrieve documents from multiple different sources, or using multiple different algorithms. The ensemble retriever allows you to easily do this.\n- And more!", metadata={'description': "Many LLM applications require user-specific data that is not part of the model's training set.", 'language': 'en', 'source': 'https://python.langchain.com/docs/modules/data_connection/', 'title': 'Retrieval | 🦜️🔗 Langchain'})]