Agi That Controls Computer
This article provides detailed information on a wide range of topics, including creating tokenizers and embeddings using ChatGPT-derived projects, remotely controlling OBS, handling complex neural networks, generating multimodal data output, utilizing LoRA for AI model performance improvement, annotating datasets, developing task-specific embeddings in classification models, working with various libraries for managing neural networks, considering VNC server factors, using webdav-cli
, recording video on Ubuntu ARM VM, adjusting display settings in UTM VMs, and resizing UTM VM disks via virtio disk resizing and Gparted.
make specialized (in RPA) tokenizer and embedding for this new model. add new words to the tokenizer.
you can just boot ubuntu/kali/parrot iso without installing.
but that would make us embarrasing. we need to check for the option.
use ChatGPT-derived projects for localized propaganda on CyberGod and The Frozen Forest.
obs remote control
using obs-websocket you can use python to do real scripting. but first spin up obs first (with websocket related commandline arguments)
launch obs in minimized way obs --minimize-to-tray
or just using xvfb.
you can also write and load scripts for obs, run on custom intervals and conditions.
audio recording
your OS may go slient if you want to record audio from “speakers”
using pyaudio, on macos, you need blackhole for sending all audio to oblivion, thus able to be recorded.
on Linux, you need audio loopback device.
run: sudo modprobe snd-aloop
you use hw:1:0
or “Analog Device Output” for slient output/speaker, and use hw:1:1
or “Analog Device Input” for recording.
benchmarks
it is always a mystery for us to develop the right ML model. however, we can setup guidelines of good performance over specific task.
automate the benchmark, setup metrics. there could be more room for trials and imagination.
encoding
use hfft/rfft to transform multipart inputs (special bits, different part of mouse coords (x, y, dx, dy))
if you want to use complex number as RNN input, you may need to swap ViT for ComplexConv2D, but maybe you just need a few.
libraries that handle complex neural networks:
multimodal
do our model have to output multimodal data?
if you combine some “special” bits along with token embeding by ihfft, you may have to retrain the entire damn network. also in order to make way for special bits, you may have to introduce extra linear layer.
some may prefer “LoRA”? by only introducing few tunable params and changing the overall output?
we may not annotate anything in our dataset. in contrast, we will set goals and make multiple interfaces for our model to explore.
you can add special task specific embedding before passing to main model, then minus that task specific embedding after passing to classification model.
file sharing and communication
make sure you don’t share important files as read/write on VM.
you may host some “execution server” on UTM VMs. you may expose your very large hard disk using WebDAV server. i think x11vnc and other vnc server may suffice for linux, but we always want to listen to the real operational data, including human operation/intervention, not just those in VNC protocols.
WebDAV servers:
wsgidav (python)
1 | wsgidav --host=192.168.64.1 --port=8081 --root="/Volumes/Toshiba XG3/works/agi_computer_control" --auth=anonymous |
webdav-cli (nodejs)
1 | webdav-cli --host=192.168.64.1 --port=8081 --username=root --password=root --path="/Volumes/Toshiba XG3/works/agi_computer_control" |
video recording
for Ubuntu ARM VM, mss
failed on wayland but pyautogui
works in both cases. write one python script to pipe raw images to ffmpeg for better compression ratio by shell. the final video is not “time-accurate”. it is frame by frame, matched with timestamps.
forcing ubuntu to use xorg by: sudo vim /etc/gdm3/custom.conf
resize UTM VM disks
you need to first resize the virtio disk in utm setting, then resize partition by using gparted, then update the device mapper