LoRA

2023-05-04

Agi That Controls Computer

make specialized (in RPA) tokenizer and embedding for this new model. add new words to the tokenizer.

you can just boot ubuntu/kali/parrot iso without installing.

but that would make us embarrasing. we need to check for the option.

use ChatGPT-derived projects for localized propaganda on CyberGod and The Frozen Forest.

obs remote control

using obs-websocket you can use python to do real scripting. but first spin up obs first (with websocket related commandline arguments)

launch obs in minimized way obs --minimize-to-tray or just using xvfb.

you can also write and load scripts for obs, run on custom intervals and conditions.

audio recording

your OS may go slient if you want to record audio from “speakers”

using pyaudio, on macos, you need blackhole for sending all audio to oblivion, thus able to be recorded.

on Linux, you need audio loopback device.

run: sudo modprobe snd-aloop

you use hw:1:0 or “Analog Device Output” for slient output/speaker, and use hw:1:1 or “Analog Device Input” for recording.

benchmarks

it is always a mystery for us to develop the right ML model. however, we can setup guidelines of good performance over specific task.

automate the benchmark, setup metrics. there could be more room for trials and imagination.

encoding

use hfft/rfft to transform multipart inputs (special bits, different part of mouse coords (x, y, dx, dy))

if you want to use complex number as RNN input, you may need to swap ViT for ComplexConv2D, but maybe you just need a few.

libraries that handle complex neural networks:

complexPyTorch

pytorch-complex

multimodal

do our model have to output multimodal data?

if you combine some “special” bits along with token embeding by ihfft, you may have to retrain the entire damn network. also in order to make way for special bits, you may have to introduce extra linear layer.

some may prefer “LoRA”? by only introducing few tunable params and changing the overall output?

we may not annotate anything in our dataset. in contrast, we will set goals and make multiple interfaces for our model to explore.

you can add special task specific embedding before passing to main model, then minus that task specific embedding after passing to classification model.

make sure you don’t share important files as read/write on VM.

you may host some “execution server” on UTM VMs. you may expose your very large hard disk using WebDAV server. i think x11vnc and other vnc server may suffice for linux, but we always want to listen to the real operational data, including human operation/intervention, not just those in VNC protocols.

WebDAV servers:

wsgidav (python)

1 2	wsgidav --host=192.168.64.1 --port=8081 --root="/Volumes/Toshiba XG3/works/agi_computer_control" --auth=anonymous

webdav-cli （nodejs)

1 2	webdav-cli --host=192.168.64.1 --port=8081 --username=root --password=root --path="/Volumes/Toshiba XG3/works/agi_computer_control"

video recording

for Ubuntu ARM VM, mss failed on wayland but pyautogui works in both cases. write one python script to pipe raw images to ffmpeg for better compression ratio by shell. the final video is not “time-accurate”. it is frame by frame, matched with timestamps.

forcing ubuntu to use xorg by: sudo vim /etc/gdm3/custom.conf

resize UTM VM disks

you need to first resize the virtio disk in utm setting, then resize partition by using gparted, then update the device mapper

2023-04-02

Chatgpt Local Version

Run some community contributed ChatGPT-like models on commondity PCs.

Model Selection

Below are some models we are about to use:

ChatRWKV, or RWKV-based models, some are fine-tuned on alpaca dataset.
ChatGLM-6B, open-sourced by Tsinghua KEG, with INT4 quantized version.
OpenAssistant by LAION-AI, trained on their own OIG dataset. There are also few models contributed by their discord community.
Alpaca, trained on alpaca dataset (synthetic, generated by ChatGPT) by Standford University. Model weights are community provided.
ChatYuan by ClueAI.

There are quite a few more models to be listed. You can check this curated open-sourced ChatGPT-like model list for updates. But for now, these models shall be sufficient.

Quantization and Optimization

Floating-point values in model weights are stored as 32bit. Quantization can reduce storage space and computation by switching to 16bit, 8bit or 4bit values. However, most quantized models cannot be trained or fine-tuned, some 16bit models can only be trained on certain architecture of GPUs, such as Ada and Turing.

To make LLM (Large Language Model) inference feasible on common hardware, GPU is usually mandatory. However, most commondity GPUs have smaller VRAM compared to RAM, limiting the size of LLM to be run, thus the capability of the LLM. Most computer have 12GB of VRAM, 32GB of RAM. GGML is a project aiming to make LLM inference on CPU as fast as GPU, utilizing larger RAM compared to VRAM to run larger LLMs. Currently some popular LLMs have been ported to GGML, like LLaMA and Alpaca.

Training and Fine-tuning

In deeplearning, people tend to tune all parameters during training, requiring much VRAM and time. To train GPT3.5 aka ChatGPT, OpenAI spends millions to rent interconnected A100 GPUs. This is impossible for an individual to afford such.

With technologies like LoRA, by freezing most part of the model and introducing a small fraction of tunable parameters, training requirements can be greatly reduced. One can easily tune 7B LLaMA or 14B RWKV using LoRA on a PC (usually rented on the cloud, such as AutoDL) with a single 80GB A100 card and 200GB of RAM.

Prompting and Chaining

LLMs are general problem solvers given enough external storage and access to search engines. Text is the only way to language models (not for multimodal LLMs, like GPT4, OFA or UniLM).

To enhance the capability of LLMs, you have to maintain its memory, define action keywords and trigger external actions during the conversation, connect it to semantic search engines powered by other AI models like sentence transformers.

One such library is LangChain.

Serving as API

The process of generation for LLMs is sequential. Server needs to maintain a streaming API to match this behavior. Tokens are fetched one by one from the server with a constant speed, revealed in the frontend.

One can check third-party frontend-only or self-hosted projects for conversational LLMs for reference.

2023-05-04 Agi That Controls Computer