Extracción de texto con base en el contexto

Librerías

import pathlib
import re
from functools import partial
from typing import Generator

from bs4 import BeautifulSoup, Doctype, NavigableString, SoupStrainer, Tag
from dotenv import load_dotenv
from html2text import HTML2Text
from IPython.core.display import Markdown
from langchain.document_loaders import DocugamiLoader, RecursiveUrlLoader

load_dotenv()
True

Web

Dataset y función de utilidad

doc_url = "https://python.langchain.com/docs/get_started/quickstart"

load_documents = partial(
    RecursiveUrlLoader,
    url=doc_url,
    max_depth=3,
    prevent_outside=True,
    check_response_status=True,
)

Extracción de texto sin tener en cuenta el contexto

La primera aproximación para extraer texto de una página web es simplemente obtener el texto de todos los elementos de la página.

def webpage_text_extractor(html: str) -> str:
    return BeautifulSoup(html, "lxml").get_text(separator="\n", strip=True)


loader = load_documents(
    extractor=webpage_text_extractor,
)

docs_without_data_context = loader.load()
print(docs_without_data_context[0].page_content[:520])
Quickstart | 🦜️🔗 Langchain
Skip to main content
🦜️🔗 LangChain
Docs
Use cases
Integrations
API
Community
Chat our docs
LangSmith
JS/TS Docs
Search
CTRL
K
Get started
Introduction
Installation
Quickstart
Modules
Model I/​O
Retrieval
Chains
Memory
Agents
Callbacks
Modules
LangChain Expression Language
Guides
More
Get started
Quickstart
On this page
Quickstart
Installation
​
To install LangChain run:
Pip
Conda
pip
install
langchain
conda
install
langchain -c conda-forge
For more details, see our
Installation guide
.
En

Extracción de texto teniendo un poco de contexto

El texto de la documentación de Langchain está escrito en Markdown, teniendo una estructura que puede ser aprovechada para extraer el texto de manera más precisa. Para ello, utilizaremos una librería que nos permita convertir el texto de HTML a Markdown.

def markdown_extractor(html: str) -> str:
    html2text = HTML2Text()
    html2text.ignore_links = False
    html2text.ignore_images = False
    return html2text.handle(html)


loader = load_documents(
    extractor=markdown_extractor,
)

docs_with_a_bit_of_context = loader.load()
print(docs_with_a_bit_of_context[0].page_content[:3000])
Skip to main content

[ **🦜️🔗 LangChain**](/)[Docs](/docs/get_started/introduction)[Use
cases](/docs/use_cases/question_answering/)[Integrations](/docs/integrations/providers)[API](https://api.python.langchain.com)[Community](/docs/community)

[Chat our
docs](https://chat.langchain.com)[LangSmith](https://smith.langchain.com)[JS/TS
Docs](https://js.langchain.com/docs)[](https://github.com/langchain-
ai/langchain)

Search

CTRLK

  * [Get started](/docs/get_started)

    * [Introduction](/docs/get_started/introduction)
    * [Installation](/docs/get_started/installation)
    * [Quickstart](/docs/get_started/quickstart)
  * [Modules](/docs/modules/)

    * [Model I/​O](/docs/modules/model_io/)

    * [Retrieval](/docs/modules/data_connection/)

    * [Chains](/docs/modules/chains/)

    * [Memory](/docs/modules/memory/)

    * [Agents](/docs/modules/agents/)

    * [Callbacks](/docs/modules/callbacks/)

    * [Modules](/docs/modules/)
  * [LangChain Expression Language](/docs/expression_language/)

  * [Guides](/docs/guides)

  * [More](/docs/additional_resources)

  * [](/)
  * [Get started](/docs/get_started)
  * Quickstart

On this page

# Quickstart

## Installation​

To install LangChain run:

  * Pip
  * Conda

    
    
    pip install langchain  
    
    
    
    conda install langchain -c conda-forge  
    

For more details, see our [Installation
guide](/docs/get_started/installation.html).

## Environment setup​

Using LangChain will usually require integrations with one or more model
providers, data stores, APIs, etc. For this example, we'll use OpenAI's model
APIs.

First we'll need to install their Python package:

    
    
    pip install openai  
    

Accessing the API requires an API key, which you can get by creating an
account and heading [here](https://platform.openai.com/account/api-keys). Once
we have a key we'll want to set it as an environment variable by running:

    
    
    export OPENAI_API_KEY="..."  
    

If you'd prefer not to set an environment variable you can pass the key in
directly via the `openai_api_key` named parameter when initiating the OpenAI
LLM class:

    
    
    from langchain.llms import OpenAI  
      
    llm = OpenAI(openai_api_key="...")  
    

## Building an application​

Now we can start building our language model application. LangChain provides
many modules that can be used to build language model applications. Modules
can be used as stand-alones in simple applications and they can be combined
for more complex use cases.

The most common and most important chain that LangChain helps create contains
three things:

  * LLM: The language model is the core reasoning engine here. In order to work with LangChain, you need to understand the different types of language models and how to work with them.
  * Prompt Templates: This provides instructions to the language model. This controls what the language model outputs, so understanding how to construct prompts and different prompting strategi

Extracción de texto teniendo en cuenta el contexto

Si bien, cuando utilizamos una librería para convertir el texto de HTML a Markdown pudimos extraer el texto de manera más precisa, aún hay algunos casos en los que no se logra extraer el texto de manera correcta.

Es aquí donde entra en juego el dominio del problema. Con base en el conocimiento que tenemos del problema, podemos crear una función que nos permita extraer el texto de manera más precisa.

Imagina que langchain_docs_extractor es como un obrero especializado en una fábrica cuyo trabajo es transformar materias primas (documentos HTML) en un producto terminado (un string limpio y formateado). Este obrero usa una herramienta especial, get_text, como una máquina para procesar las materias primas en piezas utilizables, examinando cada componente de la materia prima pieza por pieza, y usa el mismo proceso repetidamente (recursividad) para descomponer los componentes en su forma más simple. Al final, ensambla todas las piezas procesadas en un producto completo y hace algunos refinamientos finales antes de que el producto salga de la fábrica.

def langchain_docs_extractor(
    html: str,
    include_output_cells: bool,
    path_url: str | None = None,
) -> str:
    soup = BeautifulSoup(
        html,
        "lxml",
        parse_only=SoupStrainer(name="article"),
    )

    # Remove all the tags that are not meaningful for the extraction.
    SCAPE_TAGS = ["nav", "footer", "aside", "script", "style"]
    [tag.decompose() for tag in soup.find_all(SCAPE_TAGS)]

    # get_text() method returns the text of the tag and all its children.
    def get_text(tag: Tag) -> Generator[str, None, None]:
        for child in tag.children:
            if isinstance(child, Doctype):
                continue

            if isinstance(child, NavigableString):
                yield child.get_text()
            elif isinstance(child, Tag):
                if child.name in ["h1", "h2", "h3", "h4", "h5", "h6"]:
                    text = child.get_text(strip=False)

                    if text == "API Reference:":
                        yield f"> **{text}**\n"
                        ul = child.find_next_sibling("ul")
                        if ul is not None and isinstance(ul, Tag):
                            ul.attrs["api_reference"] = "true"
                    else:
                        yield f"{'#' * int(child.name[1:])} "
                        yield from child.get_text(strip=False)

                        if path_url is not None:
                            link = child.find("a")
                            if link is not None:
                                yield f" [](/{path_url}/{link.get('href')})"
                        yield "\n\n"
                elif child.name == "a":
                    yield f"[{child.get_text(strip=False)}]({child.get('href')})"
                elif child.name == "img":
                    yield f"![{child.get('alt', '')}]({child.get('src')})"
                elif child.name in ["strong", "b"]:
                    yield f"**{child.get_text(strip=False)}**"
                elif child.name in ["em", "i"]:
                    yield f"_{child.get_text(strip=False)}_"
                elif child.name == "br":
                    yield "\n"
                elif child.name == "code":
                    parent = child.find_parent()
                    if parent is not None and parent.name == "pre":
                        classes = parent.attrs.get("class", "")

                        language = next(
                            filter(lambda x: re.match(r"language-\w+", x), classes),
                            None,
                        )
                        if language is None:
                            language = ""
                        else:
                            language = language.split("-")[1]

                        if language in ["pycon", "text"] and not include_output_cells:
                            continue

                        lines: list[str] = []
                        for span in child.find_all("span", class_="token-line"):
                            line_content = "".join(
                                token.get_text() for token in span.find_all("span")
                            )
                            lines.append(line_content)

                        code_content = "\n".join(lines)
                        yield f"```{language}\n{code_content}\n```\n\n"
                    else:
                        yield f"`{child.get_text(strip=False)}`"

                elif child.name == "p":
                    yield from get_text(child)
                    yield "\n\n"
                elif child.name == "ul":
                    if "api_reference" in child.attrs:
                        for li in child.find_all("li", recursive=False):
                            yield "> - "
                            yield from get_text(li)
                            yield "\n"
                    else:
                        for li in child.find_all("li", recursive=False):
                            yield "- "
                            yield from get_text(li)
                            yield "\n"
                    yield "\n\n"
                elif child.name == "ol":
                    for i, li in enumerate(child.find_all("li", recursive=False)):
                        yield f"{i + 1}. "
                        yield from get_text(li)
                        yield "\n\n"
                elif child.name == "div" and "tabs-container" in child.attrs.get(
                    "class", [""]
                ):
                    tabs = child.find_all("li", {"role": "tab"})
                    tab_panels = child.find_all("div", {"role": "tabpanel"})
                    for tab, tab_panel in zip(tabs, tab_panels):
                        tab_name = tab.get_text(strip=True)
                        yield f"{tab_name}\n"
                        yield from get_text(tab_panel)
                elif child.name == "table":
                    thead = child.find("thead")
                    header_exists = isinstance(thead, Tag)
                    if header_exists:
                        headers = thead.find_all("th")
                        if headers:
                            yield "| "
                            yield " | ".join(header.get_text() for header in headers)
                            yield " |\n"
                            yield "| "
                            yield " | ".join("----" for _ in headers)
                            yield " |\n"

                    tbody = child.find("tbody")
                    tbody_exists = isinstance(tbody, Tag)
                    if tbody_exists:
                        for row in tbody.find_all("tr"):
                            yield "| "
                            yield " | ".join(
                                cell.get_text(strip=True) for cell in row.find_all("td")
                            )
                            yield " |\n"

                    yield "\n\n"
                elif child.name in ["button"]:
                    continue
                else:
                    yield from get_text(child)

    joined = "".join(get_text(soup))
    return re.sub(r"\n\n+", "\n\n", joined).strip()


loader = load_documents(
    extractor=partial(
        langchain_docs_extractor,
        include_output_cells=True,
    ),
)

docs_with_data_context = loader.load()
print(docs_with_data_context[0].page_content[:3000])
# Quickstart

## Installation​

To install LangChain run:

Pip
```bash
pip install langchain
```

Conda
```bash
conda install langchain -c conda-forge
```

For more details, see our [Installation guide](/docs/get_started/installation.html).

## Environment setup​

Using LangChain will usually require integrations with one or more model providers, data stores, APIs, etc. For this example, we'll use OpenAI's model APIs.

First we'll need to install their Python package:

```bash
pip install openai
```

Accessing the API requires an API key, which you can get by creating an account and heading [here](https://platform.openai.com/account/api-keys). Once we have a key we'll want to set it as an environment variable by running:

```bash
export OPENAI_API_KEY="..."
```

If you'd prefer not to set an environment variable you can pass the key in directly via the `openai_api_key` named parameter when initiating the OpenAI LLM class:

```python
from langchain.llms import OpenAI

llm = OpenAI(openai_api_key="...")
```

## Building an application​

Now we can start building our language model application. LangChain provides many modules that can be used to build language model applications.
Modules can be used as stand-alones in simple applications and they can be combined for more complex use cases.

The most common and most important chain that LangChain helps create contains three things:

- LLM: The language model is the core reasoning engine here. In order to work with LangChain, you need to understand the different types of language models and how to work with them.
- Prompt Templates: This provides instructions to the language model. This controls what the language model outputs, so understanding how to construct prompts and different prompting strategies is crucial.
- Output Parsers: These translate the raw response from the LLM to a more workable format, making it easy to use the output downstream.

In this getting started guide we will cover those three components by themselves, and then go over how to combine all of them.
Understanding these concepts will set you up well for being able to use and customize LangChain applications.
Most LangChain applications allow you to configure the LLM and/or the prompt used, so knowing how to take advantage of this will be a big enabler.

## LLMs​

There are two types of language models, which in LangChain are called:

- LLMs: this is a language model which takes a string as input and returns a string
- ChatModels: this is a language model which takes a list of messages as input and returns a message

The input/output for LLMs is simple and easy to understand - a string.
But what about ChatModels? The input there is a list of `ChatMessage`s, and the output is a single `ChatMessage`.
A `ChatMessage` has two required components:

- `content`: This is the content of the message.
- `role`: This is the role of the entity from which the `ChatMessage` is coming from.

LangChain provides several objects to easily disting

El archivo de salida es ahora en formato Markdown, lo que permite visualizarlo en cualquier editor de texto o en GitHub, ofreciendo una estructura de la información más clara y accesible. Esta organización permite realizar cortes de texto con mayor precisión, facilitando así la obtención de información más pertinente y relevante.

Markdown(docs_with_data_context[0].page_content)

Quickstart

Installation​

To install LangChain run:

Pip

pip install langchain

Conda

conda install langchain -c conda-forge

For more details, see our Installation guide.

Environment setup​

Using LangChain will usually require integrations with one or more model providers, data stores, APIs, etc. For this example, we’ll use OpenAI’s model APIs.

First we’ll need to install their Python package:

pip install openai

Accessing the API requires an API key, which you can get by creating an account and heading here. Once we have a key we’ll want to set it as an environment variable by running:

export OPENAI_API_KEY="..."

If you’d prefer not to set an environment variable you can pass the key in directly via the openai_api_key named parameter when initiating the OpenAI LLM class:

from langchain.llms import OpenAI

llm = OpenAI(openai_api_key="...")

Building an application​

Now we can start building our language model application. LangChain provides many modules that can be used to build language model applications. Modules can be used as stand-alones in simple applications and they can be combined for more complex use cases.

The most common and most important chain that LangChain helps create contains three things:

  • LLM: The language model is the core reasoning engine here. In order to work with LangChain, you need to understand the different types of language models and how to work with them.
  • Prompt Templates: This provides instructions to the language model. This controls what the language model outputs, so understanding how to construct prompts and different prompting strategies is crucial.
  • Output Parsers: These translate the raw response from the LLM to a more workable format, making it easy to use the output downstream.

In this getting started guide we will cover those three components by themselves, and then go over how to combine all of them. Understanding these concepts will set you up well for being able to use and customize LangChain applications. Most LangChain applications allow you to configure the LLM and/or the prompt used, so knowing how to take advantage of this will be a big enabler.

LLMs​

There are two types of language models, which in LangChain are called:

  • LLMs: this is a language model which takes a string as input and returns a string
  • ChatModels: this is a language model which takes a list of messages as input and returns a message

The input/output for LLMs is simple and easy to understand - a string. But what about ChatModels? The input there is a list of ChatMessages, and the output is a single ChatMessage. A ChatMessage has two required components:

  • content: This is the content of the message.
  • role: This is the role of the entity from which the ChatMessage is coming from.

LangChain provides several objects to easily distinguish between different roles:

  • HumanMessage: A ChatMessage coming from a human/user.
  • AIMessage: A ChatMessage coming from an AI/assistant.
  • SystemMessage: A ChatMessage coming from the system.
  • FunctionMessage: A ChatMessage coming from a function call.

If none of those roles sound right, there is also a ChatMessage class where you can specify the role manually. For more information on how to use these different messages most effectively, see our prompting guide.

LangChain provides a standard interface for both, but it’s useful to understand this difference in order to construct prompts for a given language model. The standard interface that LangChain provides has two methods:

  • predict: Takes in a string, returns a string
  • predict_messages: Takes in a list of messages, returns a message.

Let’s see how to work with these different types of models and these different types of inputs. First, let’s import an LLM and a ChatModel.

from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

llm = OpenAI()
chat_model = ChatOpenAI()

llm.predict("hi!")
>>> "Hi"

chat_model.predict("hi!")
>>> "Hi"

The OpenAI and ChatOpenAI objects are basically just configuration objects. You can initialize them with parameters like temperature and others, and pass them around.

Next, let’s use the predict method to run over a string input.

text = "What would be a good company name for a company that makes colorful socks?"

llm.predict(text)
# >> Feetful of Fun

chat_model.predict(text)
# >> Socks O'Color

Finally, let’s use the predict_messages method to run over a list of messages.

from langchain.schema import HumanMessage

text = "What would be a good company name for a company that makes colorful socks?"
messages = [HumanMessage(content=text)]

llm.predict_messages(messages)
# >> Feetful of Fun

chat_model.predict_messages(messages)
# >> Socks O'Color

For both these methods, you can also pass in parameters as key word arguments. For example, you could pass in temperature=0 to adjust the temperature that is used from what the object was configured with. Whatever values are passed in during run time will always override what the object was configured with.

Prompt templates​

Most LLM applications do not pass user input directly into an LLM. Usually they will add the user input to a larger piece of text, called a prompt template, that provides additional context on the specific task at hand.

In the previous example, the text we passed to the model contained instructions to generate a company name. For our application, it’d be great if the user only had to provide the description of a company/product, without having to worry about giving the model instructions.

PromptTemplates help with exactly this! They bundle up all the logic for going from user input into a fully formatted prompt. This can start off very simple - for example, a prompt to produce the above string would just be:

from langchain.prompts import PromptTemplate

prompt = PromptTemplate.from_template("What is a good name for a company that makes {product}?")
prompt.format(product="colorful socks")
What is a good name for a company that makes colorful socks?

However, the advantages of using these over raw string formatting are several. You can “partial” out variables - e.g. you can format only some of the variables at a time. You can compose them together, easily combining different templates into a single prompt. For explanations of these functionalities, see the section on prompts for more detail.

PromptTemplates can also be used to produce a list of messages. In this case, the prompt not only contains information about the content, but also each message (its role, its position in the list, etc) Here, what happens most often is a ChatPromptTemplate is a list of ChatMessageTemplates. Each ChatMessageTemplate contains instructions for how to format that ChatMessage - its role, and then also its content. Let’s take a look at this below:

from langchain.prompts.chat import ChatPromptTemplate

template = "You are a helpful assistant that translates {input_language} to {output_language}."
human_template = "{text}"

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", template),
    ("human", human_template),
])

chat_prompt.format_messages(input_language="English", output_language="French", text="I love programming.")
[
    SystemMessage(content="You are a helpful assistant that translates English to French.", additional_kwargs={}),
    HumanMessage(content="I love programming.")
]

ChatPromptTemplates can also be constructed in other ways - see the section on prompts for more detail.

Output parsers​

OutputParsers convert the raw output of an LLM into a format that can be used downstream. There are few main type of OutputParsers, including:

  • Convert text from LLM -> structured information (e.g. JSON)
  • Convert a ChatMessage into just a string
  • Convert the extra information returned from a call besides the message (like OpenAI function invocation) into a string.

For full information on this, see the section on output parsers

In this getting started guide, we will write our own output parser - one that converts a comma separated list into a list.

from langchain.schema import BaseOutputParser

class CommaSeparatedListOutputParser(BaseOutputParser):
    """Parse the output of an LLM call to a comma-separated list."""

    def parse(self, text: str):
        """Parse the output of an LLM call."""
        return text.strip().split(", ")

CommaSeparatedListOutputParser().parse("hi, bye")
# >> ['hi', 'bye']

PromptTemplate + LLM + OutputParser​

We can now combine all these into one chain. This chain will take input variables, pass those to a prompt template to create a prompt, pass the prompt to a language model, and then pass the output through an (optional) output parser. This is a convenient way to bundle up a modular piece of logic. Let’s see it in action!

from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import ChatPromptTemplate
from langchain.schema import BaseOutputParser

class CommaSeparatedListOutputParser(BaseOutputParser):
    """Parse the output of an LLM call to a comma-separated list."""

    def parse(self, text: str):
        """Parse the output of an LLM call."""
        return text.strip().split(", ")

template = """You are a helpful assistant who generates comma separated lists.
A user will pass in a category, and you should generate 5 objects in that category in a comma separated list.
ONLY return a comma separated list, and nothing more."""
human_template = "{text}"

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", template),
    ("human", human_template),
])
chain = chat_prompt | ChatOpenAI() | CommaSeparatedListOutputParser()
chain.invoke({"text": "colors"})
# >> ['red', 'blue', 'green', 'yellow', 'orange']

Note that we are using the | syntax to join these components together. This | syntax is called the LangChain Expression Language. To learn more about this syntax, read the documentation here.

Next steps​

This is it! We’ve now gone over how to create the core building block of LangChain applications. There is a lot more nuance in all these components (LLMs, prompts, output parsers) and a lot more different components to learn about as well. To continue on your journey:

PDF / DOCX / DOC

Dataset de prueba

En este ejemplo, vamos a emplear algunos archivos de muestra proporcionados por Docugami. Dichos archivos representan el producto de la extracción de texto de documentos auténticos, en particular, de archivos PDF relativos a contratos de arrendamiento comercial.

lease_data_dir = pathlib.Path("../data/docugami/commercial_lease")
lease_files = list(lease_data_dir.glob("*.xml"))
lease_files
[PosixPath('../data/docugami/commercial_lease/TruTone Lane 6.xml'),
 PosixPath('../data/docugami/commercial_lease/TruTone Lane 5.xml'),
 PosixPath('../data/docugami/commercial_lease/TruTone Lane 4.xml'),
 PosixPath('../data/docugami/commercial_lease/TruTone Lane 1.xml'),
 PosixPath('../data/docugami/commercial_lease/TruTone Lane 3.xml'),
 PosixPath('../data/docugami/commercial_lease/TruTone Lane 2.xml')]

Ahora, carguemos los documentos de muestra y veamos qué propiedades tienen.

loader = DocugamiLoader(
    docset_id=None,
    access_token=None,
    document_ids=None,
    file_paths=lease_files,
)

lease_docs = loader.load()
f"Loaded {len(lease_docs)} documents."
'Loaded 1108 documents.'

La metadata obtenida del documento incluye los siguientes elementos:

  • id, source_id y name: Estos campos identifican de manera unívoca al documento y al fragmento de texto que se ha extraído de él.
  • xpath: Es el XPath correspondiente dentro de la representación XML del documento. Se refiere específicamente al fragmento extraído. Este campo es útil para referenciar directamente las citas del fragmento real dentro del documento XML.
  • structure: Incluye los atributos estructurales del fragmento, tales como p, h1, div, table, td, entre otros. Es útil para filtrar ciertos tipos de fragmentos, en caso de que el usuario los requiera.
  • tag: Representa la etiqueta semántica para el fragmento. Se genera utilizando diversas técnicas, tanto generativas como extractivas, para determinar el significado del fragmento en cuestión.
lease_docs[0].metadata
{'xpath': '/dg:chunk/docset:OFFICELEASEAGREEMENT-section/docset:OFFICELEASEAGREEMENT-section/docset:OFFICELEASEAGREEMENT/docset:Lease',
 'id': 'TruTone Lane 6.xml',
 'name': 'TruTone Lane 6.xml',
 'source': 'TruTone Lane 6.xml',
 'structure': 'p',
 'tag': 'Lease'}

Docugami también posee la capacidad de asistir en la extracción de metadatos específicos para cada chunk o fragmento de nuestros documentos. A continuación, se presenta un ejemplo de cómo se extraen y representan estos metadatos:

{
    'xpath': '/docset:OFFICELEASEAGREEMENT-section/docset:OFFICELEASEAGREEMENT/docset:LeaseParties',
    'id': 'v1bvgaozfkak',
    'source': 'TruTone Lane 2.docx',
    'structure': 'p',
    'tag': 'LeaseParties',
    'Lease Date': 'April 24 \n\n ,',
    'Landlord': 'BUBBA CENTER PARTNERSHIP',
    'Tenant': 'Truetone Lane LLC',
    'Lease Parties': 'Este ACUERDO DE ARRENDAMIENTO DE OFICINA (el "Contrato") es celebrado por y entre BUBBA CENTER PARTNERSHIP ("Arrendador"), y Truetone Lane LLC, una compañía de responsabilidad limitada de Delaware ("Arrendatario").'
}

Los metadatos adicionales, como los mostrados arriba, pueden ser extremadamente útiles cuando se implementan self-retrievers, los cuales serán explorados adetalle más adelante.

Carga tus documentos

Si prefieres utilizar tus propios documentos, puedes cargarlos a través de la interfaz gráfica de Docugami. Una vez cargados, necesitarás asignar cada uno a un docset. Un docset es un conjunto de documentos que presentan una estructura análoga. Por ejemplo, todos los contratos de arrendamiento comercial por lo general poseen estructuras similares, por lo que pueden ser agrupados en un único docset.

Después de crear tu docset, los documentos cargados serán procesados y estarán disponibles para su acceso mediante la API de Docugami.

Para recuperar los ids de tus documentos y de sus correspondientes docsets, puedes ejecutar el siguiente comando:

curl --header "Authorization: Bearer {YOUR_DOCUGAMI_TOKEN}" \
  https://api.docugami.com/v1preview1/documents

Este comando te facilitará el acceso a la información relevante, optimizando así la administración y organización de tus documentos dentro de Docugami.

Una vez hayas extraído los ids de tus documentos o de los docsets, podrás emplearlos para acceder a la información de tus documentos utilizando el DocugamiLoader de Langchain. Esto te permitirá manipular y gestionar tus documentos dentro de tu aplicación.

loader = DocugamiLoader(
    docset_id="xpfpiyl7cep2",
    document_ids=None,
    file_paths=None,
)

papers_docs = loader.load()
lost_in_the_middle_paper_docs = [
    doc for doc in papers_docs if doc.metadata["source"] == "2307.03172.pdf"
]
for doc in lost_in_the_middle_paper_docs:
    print(doc.metadata["tag"])
LostintheMiddle
chunk
chunk
Abstract
chunk
AnImportantAndFlexibleBuildingBlock
TheseUse-cases
chunk
chunk
chunk
Figure1
chunk
Transformers
Extended-contextLanguageModels
TrolledExperiments
ADistinctiveU-shapedPerformance
_5-turboS
LanguageModels
LanguageModels
ACaseStudy
_2LanguageModels
IncreasingLanguageModelMaximumContext
OurGoal
chunk
ModelPerformance
OurMulti-documentQuestion
ThisTask
Naturalquestions-open
RandomDocuments
AHigh-qualityAnswer
Asian
chunk
chunk
AHigh-qualityAnswer
chunk
Document
Norwegian
Question
chunk
Figure3
TheInputContextLength
Kandpal
SearchResults
AHigh-qualityAnswer
chunk
Asian
Question
chunk
Figure4
TheNaturalquestionsAnnotations
chunk
AMaximumContextLength
chunk
td
td
td
td
td
td
td
td
td
td
td
td
td
_7-cell
ExtracttheValueCorrespondingtotheSpecifiedKeyintheJSONObjectBelowVa
td
ExtracttheValueCorrespondingtotheSpecifiedKeyintheJSONObjectBelowEx
ExtracttheValueCorrespondingtotheSpecifiedKeyintheJSONObjectBelow
td
td
td
td
td
td
td
td
td
td
td
td
td
td
td
td
td
td
td
td
td
td
td
td
Figure9-cell
td
td
td
The-art
AMaximumContextLength
ClosedModels
Gpt-35-turbo
TheAnthropicApi
InputContexts
ModelPerformance
ASubset
Figure6
Contexts
ModelPerformance
PerformanceDecrease
Extended-contextModels
LanguageModels
OurSyntheticKey-valueRetrievalTask
OurSyntheticKey-valueRetrievalTask
Figure8
PresentPotentialConfounders
Key-valueRetrievalPerformance
chunk
TheSyntheticKey-valueRetrievalTask
TheKey-valueRetrievalTask
The140Key-valueSetting
chunk
OurMulti-documentQuestion
chunk
TheOpenModels
OurExperiments
tr
chunk
TheEnd
TheSameSetting
TheModels
Mpt-30bMpt-30b-instruct
chunk
Formation
TheseObservations
chunk
PracticalSettings
Models
Figure
td
chunk
chunk
ARichLine
chunk
ThePioneeringWork
TheU-shapedCurve
ASeries
Instruction-tuning
SewonMin
AviArampatzis
IzBeltagy
HyungWonChung
ZihangDai
MichaøDaniluk
TriDao
chunk
chunk
DanielYFu
AlbertGu
chunk
Long-textUnderstanding
GautierIzacard
GautierIzacard
NikhilKandpal
UrvashiKhandelwal
chunk
Field
KentonLee
DachengLi
AlexMallen
SewonMin
BennetBMurdockJr
JoeOConnor
DimitrisPapailiopoulos
chunk
HaoPeng
FabioPetroni
MichaelPoli
OfirPress
OfirPress
GuanghuiQin
chunk
IneLee
YoavLevine
chunk
chunk
ChinnadhuraiSankar
TimoSchick
chunk
OmerLevy
UriShaham
VatsalSharan
WeijiaShi
chunk
chunk
KalpeshKrishna
YiTay
YiTay
MostafaDehghani
HugoTouvron
AshishVaswani
SinongWang
chunk
chunk
chunk
chunk
ManzilZaheer
chunk
PastWork
chunk
chunk
td
td
Figure15
Multi-documentQuestion
chunk
chunk
td
td
chunk
OurPrompt
chunk
ASubset
chunk
Figure17
_1st5th10th15th20thPosition
td
chunk
Table
td
td
td
td
td
Table1
Table
td
td
td
td
td
Table2
td
td
td
td
td
td
td
td
td
td
Table3
td
td
td
td
td
td
td
td
td
td
Table4
chunk
ModelPerformance
td
td
td
td
td
td
td
Table
td
td
td
td
td
td
td
Table
td
td
td
td
td
td
td
td
td
td
Table