Spaces:
Runtime error
Runtime error
| # Ingesting & Managing Documents | |
| The ingestion of documents can be done in different ways: | |
| * Using the `/ingest` API | |
| * Using the Gradio UI | |
| * Using the Bulk Local Ingestion functionality (check next section) | |
| ## Bulk Local Ingestion | |
| When you are running PrivateGPT in a fully local setup, you can ingest a complete folder for convenience (containing | |
| pdf, text files, etc.) | |
| and optionally watch changes on it with the command: | |
| ```bash | |
| make ingest /path/to/folder -- --watch | |
| ``` | |
| To log the processed and failed files to an additional file, use: | |
| ```bash | |
| make ingest /path/to/folder -- --watch --log-file /path/to/log/file.log | |
| ``` | |
| **Note for Windows Users:** Depending on your Windows version and whether you are using PowerShell to execute | |
| PrivateGPT API calls, you may need to include the parameter name before passing the folder path for consumption: | |
| ```bash | |
| make ingest arg=/path/to/folder -- --watch --log-file /path/to/log/file.log | |
| ``` | |
| After ingestion is complete, you should be able to chat with your documents | |
| by navigating to http://localhost:8001 and using the option `Query documents`, | |
| or using the completions / chat API. | |
| ## Ingestion troubleshooting | |
| ### Running out of memory | |
| To do not run out of memory, you should ingest your documents without the LLM loaded in your (video) memory. | |
| To do so, you should change your configuration to set `llm.mode: mock`. | |
| You can also use the existing `PGPT_PROFILES=mock` that will set the following configuration for you: | |
| ```yaml | |
| llm: | |
| mode: mock | |
| embedding: | |
| mode: local | |
| ``` | |
| This configuration allows you to use hardware acceleration for creating embeddings while avoiding loading the full LLM into (video) memory. | |
| Once your documents are ingested, you can set the `llm.mode` value back to `local` (or your previous custom value). | |
| ### Ingestion speed | |
| The ingestion speed depends on the number of documents you are ingesting, and the size of each document. | |
| To speed up the ingestion, you can change the ingestion mode in configuration. | |
| The following ingestion mode exist: | |
| * `simple`: historic behavior, ingest one document at a time, sequentially | |
| * `batch`: read, parse, and embed multiple documents using batches (batch read, and then batch parse, and then batch embed) | |
| * `parallel`: read, parse, and embed multiple documents in parallel. This is the fastest ingestion mode for local setup. | |
| * `pipeline`: Alternative to parallel. | |
| To change the ingestion mode, you can use the `embedding.ingest_mode` configuration value. The default value is `simple`. | |
| To configure the number of workers used for parallel or batched ingestion, you can use | |
| the `embedding.count_workers` configuration value. If you set this value too high, you might run out of | |
| memory, so be mindful when setting this value. The default value is `2`. | |
| For `batch` mode, you can easily set this value to your number of threads available on your CPU without | |
| running out of memory. For `parallel` mode, you should be more careful, and set this value to a lower value. | |
| The configuration below should be enough for users who want to stress more their hardware: | |
| ```yaml | |
| embedding: | |
| ingest_mode: parallel | |
| count_workers: 4 | |
| ``` | |
| If your hardware is powerful enough, and that you are loading heavy documents, you can increase the number of workers. | |
| It is recommended to do your own tests to find the optimal value for your hardware. | |
| If you have a `bash` shell, you can use this set of command to do your own benchmark: | |
| ```bash | |
| # Wipe your local data, to put yourself in a clean state | |
| # This will delete all your ingested documents | |
| make wipe | |
| time PGPT_PROFILES=mock python ./scripts/ingest_folder.py ~/my-dir/to-ingest/ | |
| ``` | |
| ## Supported file formats | |
| privateGPT by default supports all the file formats that contains clear text (for example, `.txt` files, `.html`, etc.). | |
| However, these text based file formats as only considered as text files, and are not pre-processed in any other way. | |
| It also supports the following file formats: | |
| * `.hwp` | |
| * `.pdf` | |
| * `.docx` | |
| * `.pptx` | |
| * `.ppt` | |
| * `.pptm` | |
| * `.jpg` | |
| * `.png` | |
| * `.jpeg` | |
| * `.mp3` | |
| * `.mp4` | |
| * `.csv` | |
| * `.epub` | |
| * `.md` | |
| * `.mbox` | |
| * `.ipynb` | |
| * `.json` | |
| **Please note the following nuance**: while `privateGPT` supports these file formats, it **might** require additional | |
| dependencies to be installed in your python's virtual environment. | |
| For example, if you try to ingest `.epub` files, `privateGPT` might fail to do it, and will instead display an | |
| explanatory error asking you to download the necessary dependencies to install this file format. | |
| **Other file formats might work**, but they will be considered as plain text | |
| files (in other words, they will be ingested as `.txt` files). |