Indexación de vectores

Librerías

from dotenv import load_dotenv
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import SQLRecordManager, index
from langchain.schema import Document
from langchain.vectorstores import Chroma

from src.langchain_docs_loader import load_langchain_docs_splitted

load_dotenv()  # It should output True
True

Carga de datos

docs = load_langchain_docs_splitted()
f"Loaded {len(docs)} documents"
'Loaded 2684 documents'

Inicialización de componentes de un índice

Para crear un índice es necesario inicializar:

  • Vectorstore: para almacenar los vectores de los documentos.
  • Record Manager: para almacenar qué vectores han sido indexados y cuándo.
collection_name = "langchain_docs_index"
embeddings = OpenAIEmbeddings()
vector_store = Chroma(
    collection_name=collection_name,
    embedding_function=embeddings,
)
namespace = "croma/{collection_name}"
record_manager = SQLRecordManager(
    namespace=namespace,
    db_url="sqlite:///:memory:",
)
record_manager.create_schema()

Es una buena práctica inicializar el namespace de nuestro Record Manager con el nombre de nuestra Vectorstore y el nombre del Collection que estamos indexando.

Indexación

Función de utilidad para limpiar nuestro índice

def clear_index():
    """Hacky helper method to clear content. See the `full` mode section to to understand why it works."""
    index(
        [],
        record_manager=record_manager,
        vector_store=vector_store,
        cleanup="full",
        source_id_key="source",
    )

Indexación con limpieza de tipo None

Esta implementación es la opción por defecto. La especificación no remueve los documentos previamente indexados. Sin embargo, sí se encarga de de remover los documentos duplicados antes de indexarlos.

index(
    docs_source=docs,
    record_manager=record_manager,
    vector_store=vector_store,
    source_id_key="source",
    cleanup=None,
)
{'num_added': 2601, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

Si dentro de un tiempo volvemos a ejecutar nuestro código de carga de datos y los documentos ya existen en el índice, no se volverán a indexar.

index(
    docs_source=docs,
    record_manager=record_manager,
    vector_store=vector_store,
    source_id_key="source",
    cleanup=None,
)
{'num_added': 0, 'num_updated': 0, 'num_skipped': 2603, 'num_deleted': 0}
clear_index()

Indexación con limpieza de tipo incremental

Al igual que la limpieza de tipo None, la limpieza incremental maneja los documentos duplicados antes de indexarlos. Sin embargo, si alguno de los vectores de un source / recurso es diferente al que ya existe en el índice, se reemplazará el vector existente por el nuevo.

index(
    docs_source=docs,
    record_manager=record_manager,
    vector_store=vector_store,
    source_id_key="source",
    cleanup="incremental",
)
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 917743 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 898707 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 921808 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 909775 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 899857 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 940071 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 925449 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 898953 / min. Contact us through our help center at help.openai.com if you continue to have issues..
{'num_added': 2601, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

Si llamamos a la función nuevamente, pero sin ningún documento, entonces nada se eliminará del índice.

index(
    docs_source=[],
    record_manager=record_manager,
    vector_store=vector_store,
    source_id_key="source",
    cleanup="incremental",
)
{'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

Si agregamos un nuevo documento, entonces se indexará.

index(
    docs_source=[Document(page_content="Hello World", metadata={"source": "hello"})],
    record_manager=record_manager,
    vector_store=vector_store,
    source_id_key="source",
    cleanup="incremental",
)
{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

Y si modificamos un documento existente, entonces se reemplazará el vector existente por el nuevo.

index(
    docs_source=[
        Document(
            page_content="Hello World, from Langchain!", metadata={"source": "hello"}
        )
    ],
    record_manager=record_manager,
    vector_store=vector_store,
    source_id_key="source",
    cleanup="incremental",
)
{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 1}
clear_index()

Indexación con limpieza de tipo full

Cualquier documento que no sea parte de la carga actual será eliminado del índice. Esto es útil cuando se quiere mantener el índice actualizado con los documentos que se encuentran en el origen de datos. Los documentos que no han sido modificados no serán indexados nuevamente.

index(
    docs_source=docs,
    record_manager=record_manager,
    vector_store=vector_store,
    source_id_key="source",
    cleanup="full",
)
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 893441 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 928130 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 923893 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 918641 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 922553 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 890289 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 888897 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 892716 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 889742 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 952551 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 951868 / min. Contact us through our help center at help.openai.com if you continue to have issues..
{'num_added': 2601, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

Indexemos nuevamente los documentos, pero sólo con una pequeña proporción de los mismos.

index(
    docs_source=docs[0:100],
    record_manager=record_manager,
    vector_store=vector_store,
    source_id_key="source",
    cleanup="full",
)
{'num_added': 0, 'num_updated': 0, 'num_skipped': 97, 'num_deleted': 2504}

La función clear_index es un caso de uso para la limpieza de tipo full

clear_index()

Indexación a partir de un BaseLoader de Langchain

Langchain establece el concepto de BaseLoader como clases que se encargan de cargar datos de diferentes fuentes de datos o con un procesamiento específico. Estos pueden ser extendidos para crear Loaders personalizados. Y, a su vez, ser utilizados para indexar documentos dentro de nuestro pipeline de ingesta.

from langchain.document_loaders.base import BaseLoader


class MyDocumentLoader(BaseLoader):
    """Here should be the logic to load the documents from the source.

    The `load` method should return a list of `Document` objects.

    In this example, we will just return the `docs` variable defined above.
    """

    def load(self) -> list[Document]:
        return docs
index(
    docs_source=MyDocumentLoader(),
    record_manager=record_manager,
    vector_store=vector_store,
    source_id_key="source",
    cleanup="full",
)
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 895998 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 923036 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 912793 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 959601 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 936087 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 907817 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 921967 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 921261 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 916848 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 908604 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-vwqjdaXGZeEg6mWAVSflJXD9 on tokens per min. Limit: 1000000 / min. Current: 902224 / min. Contact us through our help center at help.openai.com if you continue to have issues..
{'num_added': 2601, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}
clear_index()