2023-09-17
Guidelines On Designing Ai Systems

shall we design the agent system based on ‘distance’ instead of ‘role’? in that case, only words that make sense and continue more often will appear together, and vice versa.


multi-agent orchestration:

langchain

autogen

agently

ModelScope-Agent


you must think like an expert and show some example or direction as reasonable goals if you want it to perform specific task.

you must give it enough degree of freedom if you want it to self-improve and become conscious.

Read More

2023-04-02
Chatgpt Local Version

Run some community contributed ChatGPT-like models on commondity PCs.

Model Selection

Below are some models we are about to use:

There are quite a few more models to be listed. You can check this curated open-sourced ChatGPT-like model list for updates. But for now, these models shall be sufficient.

Quantization and Optimization

Floating-point values in model weights are stored as 32bit. Quantization can reduce storage space and computation by switching to 16bit, 8bit or 4bit values. However, most quantized models cannot be trained or fine-tuned, some 16bit models can only be trained on certain architecture of GPUs, such as Ada and Turing.

To make LLM (Large Language Model) inference feasible on common hardware, GPU is usually mandatory. However, most commondity GPUs have smaller VRAM compared to RAM, limiting the size of LLM to be run, thus the capability of the LLM. Most computer have 12GB of VRAM, 32GB of RAM. GGML is a project aiming to make LLM inference on CPU as fast as GPU, utilizing larger RAM compared to VRAM to run larger LLMs. Currently some popular LLMs have been ported to GGML, like LLaMA and Alpaca.

Training and Fine-tuning

In deeplearning, people tend to tune all parameters during training, requiring much VRAM and time. To train GPT3.5 aka ChatGPT, OpenAI spends millions to rent interconnected A100 GPUs. This is impossible for an individual to afford such.

With technologies like LoRA, by freezing most part of the model and introducing a small fraction of tunable parameters, training requirements can be greatly reduced. One can easily tune 7B LLaMA or 14B RWKV using LoRA on a PC (usually rented on the cloud, such as AutoDL) with a single 80GB A100 card and 200GB of RAM.

Prompting and Chaining

LLMs are general problem solvers given enough external storage and access to search engines. Text is the only way to language models (not for multimodal LLMs, like GPT4, OFA or UniLM).

To enhance the capability of LLMs, you have to maintain its memory, define action keywords and trigger external actions during the conversation, connect it to semantic search engines powered by other AI models like sentence transformers.

One such library is LangChain.

Serving as API

The process of generation for LLMs is sequential. Server needs to maintain a streaming API to match this behavior. Tokens are fetched one by one from the server with a constant speed, revealed in the frontend.

One can check third-party frontend-only or self-hosted projects for conversational LLMs for reference.

Read More