Blog

GPTScript Knowledge Tool v0.3 Introduces One-Time Configuration for Embedding Model Providers

July 31, 2024 by thorsten-klein

OK, the title may be a little exaggerated, but still, our little Knowledge tool gained the ability to use various Embedding Models or rather Embedding Model Providers in v0.3.0. GPTScript is our natural language approach to programming: prompt in English (or your native language), to write tools which are then strung together via AI to execute (view initial blog here for a refresh).

What that means is that you can now define an endless list of embedding model providers in a newly created config file and for every knowledge base (dataset) you create, you can choose any of them.

Important Note here: you can only use one embedding model per dataset, you cannot ingest one file with model A and another one with model B, as that would screw up the vector space due to different embedding types and vector dimensionalities.

As of v0.3.0 we tested the knowledge tool with the following providers and models (with no judgement of how well each of them works in terms of accuracy and performance):

  • OpenAI (default) with text-embedding-ada-002
  • Azure OpenAI with text-embedding-ada-002
  • LM-Studio (local) with CompendiumLabs/bge-large-en-v1.5-gguf

    • Note: This is not a built-in provider but can still be used by setting the corresponding OpenAI variables
    • Warning: At the time of testing, LM-Studio didn’t support parallel calls to the embeddings endpoint, so it was pretty slow and we had to set parallelism to 1 via VS_CHROMEM_EMBEDDING_PARALLEL_THREAD=1
  • Cohere with embed-english-v3.0
  • Jina with embed-english-v3.0
  • LocalAI with bert-cpp-minilm-v6
  • Mistral with mistral-embed
  • Mixedbread with all-MiniLM-L6-v2
  • Ollama with mxbai-embed-large

Local Models

I tested the locally running models, especially via LM-Studio and Ollama on my development laptop with an i7-1260P and 64GB RAM and with Ollama the processing time of ingesting a 509 pages PDF file was about 5 minutes (this was without enabling parallelism on Ollama and I bet there are some settings I can tweak, as my laptop was using pretty few resources).

Using some other Model Provider than OpenAI

With the command-line flag --embedding-model-provider or the related environment variable KNOW_EMBEDDING_MODEL_PROVIDER, you can switch between configured providers, even if you didn’t define them in a config file. By default, this is set to openai (but you could modify the OpenAI environment variables to point it to any OpenAI API compatible endpoint as well).

Now let’s say you also have access to Google Vertex AI and have all the environment variables configured (at least VERTEX_API_KEY and VERTEX_PROJECT_ID would be required in this case) and want to ingest into a newly created dataset using the Vertex provider.
You would do this like shown below:

export VERTEX_API_KEY="my-super-secret-key"
export VERTEX_PROJECT_ID="my-google-project"

knowledge ingest -d my-vertex-powered-dataset --embedding-model-provider=vertex path/to/some/files

# or alternatively
export KNOW_EMBEDDING_MODEL_PROVIDER="vertex"
knowledge ingest -d my-vertex-powered-dataset path/to/more/files

Configuring multiple Model Providers

Obviously, you can define the environment variables only once per provider.
Now you could use dotenv files to configure multiple settings, e.g. two different variations for Vertex, using different projects or models – or different Ollama servers, whatever.

This can be quite cumbersome… but don’t worry, here comes the YAML (you’ll love it – but you may use JSON as well) config file where you can define as many providers as you want and can give them different names.

Here you can see an example config that defines 3 different provider configs of which two are using the same provider type, which wouldn’t be possible with just environment variables:

embeddings:
	providers:
		- name: my-vertex # custom name
			type: vertex # one of the provider types as shown in the list further up
			config:
				apiKey: ${SOME_VERTEX_API_KEY} # environment variables will get expanded
				model: text-embedding-004
		- name: ollama-1
			type: ollama
			config:
				model: mxbai-embed-large
		- name: ollama-2
			type: ollama
			config:
				model: nomic-embed-text

With this config file we can now reference any provider by our custom name:

knowledge ingest -c /path/to/config.yaml --embedding-model-provider="ollama-1" -d my-ollama-1-dataset path/to/files

More about the Config File and Embedding Model Providers

You can find more up-to-date information on this new config file and embedding model providers that we integrated in the knowledge documentation:

Outlook

With this new setup, it will soon be possible to share knowledge bases / datasets with the embedding model provider information attached to it, finally getting rid of the hurdle that ingesting into an imported dataset may need some time figuring out exactly which model was used.
With the information about the originally used provider and model attached to the dataset, the knowledge tool just has to get your API Token (get it from env or ask you for it) and you’re setup, without any further configuration.

This will make sharing datasets and collaboratively building knowledge bases even easier!

Releated Articles