Autonomous Machines & Society.

2023-05-21
夏天制冷装置

站着操作电脑 需要脚踏板式的跑步机


循环水床垫

循环水马甲

带风扇的坐垫和靠背

地板瓷砖

加湿器风扇

水空调 自动加水 工业

除湿机 自动排水

高锰酸钾

软加硬空调出风挡板 强力胶带 防水

移动双风道空调

Read More

2023-05-04
Agi That Controls Computer

make specialized (in RPA) tokenizer and embedding for this new model. add new words to the tokenizer.


you can just boot ubuntu/kali/parrot iso without installing.

but that would make us embarrasing. we need to check for the option.


use ChatGPT-derived projects for localized propaganda on CyberGod and The Frozen Forest.

obs remote control

using obs-websocket you can use python to do real scripting. but first spin up obs first (with websocket related commandline arguments)

launch obs in minimized way obs --minimize-to-tray or just using xvfb.

you can also write and load scripts for obs, run on custom intervals and conditions.

audio recording

your OS may go slient if you want to record audio from “speakers”


using pyaudio, on macos, you need blackhole for sending all audio to oblivion, thus able to be recorded.

on Linux, you need audio loopback device.

run: sudo modprobe snd-aloop

you use hw:1:0 or “Analog Device Output” for slient output/speaker, and use hw:1:1 or “Analog Device Input” for recording.

benchmarks

it is always a mystery for us to develop the right ML model. however, we can setup guidelines of good performance over specific task.

automate the benchmark, setup metrics. there could be more room for trials and imagination.

encoding

use hfft/rfft to transform multipart inputs (special bits, different part of mouse coords (x, y, dx, dy))

if you want to use complex number as RNN input, you may need to swap ViT for ComplexConv2D, but maybe you just need a few.


libraries that handle complex neural networks:

complexPyTorch

pytorch-complex

multimodal

do our model have to output multimodal data?

if you combine some “special” bits along with token embeding by ihfft, you may have to retrain the entire damn network. also in order to make way for special bits, you may have to introduce extra linear layer.


some may prefer “LoRA”? by only introducing few tunable params and changing the overall output?


we may not annotate anything in our dataset. in contrast, we will set goals and make multiple interfaces for our model to explore.


you can add special task specific embedding before passing to main model, then minus that task specific embedding after passing to classification model.

file sharing and communication

make sure you don’t share important files as read/write on VM.


you may host some “execution server” on UTM VMs. you may expose your very large hard disk using WebDAV server. i think x11vnc and other vnc server may suffice for linux, but we always want to listen to the real operational data, including human operation/intervention, not just those in VNC protocols.


WebDAV servers:

wsgidav (python)

1
2
wsgidav --host=192.168.64.1 --port=8081 --root="/Volumes/Toshiba XG3/works/agi_computer_control"  --auth=anonymous

webdav-cli (nodejs)

1
2
webdav-cli --host=192.168.64.1 --port=8081 --username=root --password=root --path="/Volumes/Toshiba XG3/works/agi_computer_control"

video recording

for Ubuntu ARM VM, mss failed on wayland but pyautogui works in both cases. write one python script to pipe raw images to ffmpeg for better compression ratio by shell. the final video is not “time-accurate”. it is frame by frame, matched with timestamps.


forcing ubuntu to use xorg by: sudo vim /etc/gdm3/custom.conf

resize UTM VM disks

you need to first resize the virtio disk in utm setting, then resize partition by using gparted, then update the device mapper

Read More

2023-04-05
Comprehensive Page Dump From Multiple Devices

i said taking notes manually is always a bad idea if you can just do it with LLM. since web browsing is just like playing video games

to get latest personal news (like https://github.com/James4Ever0.private.atom?token=<token>), trends from github, use some rss feeds.

to monitor discord channels, monitor QQ chats, wechat messages and filter links out of it.

to dump pages and clipboards from devices, which is what we are trying to do.


better keep it tight. when doing so, you want to make it is absolutely private. nothing is public.

you can use a filter or something. you can also just keep it discrete.

Webpage dumps

use “bookmark all tabs” or “copy all tab urls” browser extension, or your OG lazero extension

macOS

check for “Comprehensive Research” series


you can only bookmark all tabs and tab groups within a single window.

for example, if you have two windows open, you need to do it twice to save all links to your computer.

Kali

check “Healthcare and Watch Cases” under firefox bookmarks.

Android

to install a plugin on kiwi browser, you (maybe) need to unpack and install it manually.

you need chrome extension v3 or below. pack the extension manually if needed. must install from .crx file.

query: site:github.com chrome copy all tabs

use this or this or this.

if you want markdown-style, use this or this or this (you need to modify this as it is the only hope now) or this. query: site:github.com chrome copy all tabs markdown.

you click every tab manually, in order to copy them. goddamn the android!


for firefox, you search in the official addon store.


clipboard data and tab records dumped at ~/works/modifier_clipboard_and_browser_tabs_dumps


remember to clean up the damn phone. it is running way too slow.

Read More

2023-04-04
Faster Python

latest python has better performance.

pypy is fast.

codon is using python syntax to compile python into static executable.

Read More

2023-04-04
Rsibreak, Break Reminder

Smart Watch

do not mix the water with the watch. you have been warned.


Prefer WearOS watches like LG W100. Found 3D printable case on Shapeways but not downloadable. Shapeways provides service for printing. Shapeways builds ShapeJS which can construct 3D models with code.


Use Pixle2Mesh++ to recover 3D meshes from multiple images (dumped at ~/Desktop/works/shapeways_reconstruct_from_image_lg_w100)of different viewpoints. Determine the size of the mesh after measurement or learning specs.


if you want iwatch instead, remember to buy some apple gift cards for buying watchos apps. remember to ask for battery life since older watches tend to die halfway in a day. buy iphone 6s and newer models with ios 14 and newer os to manage and install apps on iwatch. no need for 3d modeling since plenty tough-tested cases around.

DIY

if you want to do it on your own, you have to know how to send notifications on different operating systems.


on macOS:

1
2
osascript -e 'display notification "This message should be showing on the notification" with title "Coding Tips"'

terminal-notifier (brew installable)

alerter


on linux:

1
2
notify-send "Dinner ready!"

using remind:

1
2
3
4
5
6
remind "I'm still here" now
remind "Time to wake up!" in 5 minutes
remind "Dinner" in 1 hour
remind "Take a break" at noon
remind "It's Friday pints time!" at 17:00


on windows:

1
2
msg /SERVER:DestinationPC * /TIME:60 “This is the message to be sent to a PC named DestinationPC and closes in 60 seconds."

notify-send-for-Windows (needs AHK)

tutorial

break reminder tools

RSIBreak is for linux, and it does not work well.

stretchly has an online version. on macOS make sure your browser is allowed to post notifications.

“Drink.” on mac app store is a water drinking reminder for macOS.

BreakTimer has windows, macOS and linux version. on linux you better use snap or appimage version.

Read More

2023-04-03
Webproxy, Clash, Proxy.Py

somebody hates clash and proxy.py, now we proxy websites directly in another website:


Ultraviolet by Titanium Network

visit Holy Unblocker for demonstration

holyub-alike websites


github topic: webproxy

Read More

2023-04-03
Markdown to PDF

Gist on using python3 and PhantomJS

npm package: markdown-pdf

Read More

2023-04-03
Ask Chatgpt How To Create An Ai Model For Controlling Computer Under Human Instruction

question

how do i design an ai model using gpt, rnn, lstm, vit to accept computer display screenshots and human instructions as input, keyboard and mouse events as output, trained on random keystrokes and mouse events, also human instructions paired with human demonstrations? show me python code

answers

Claude+

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
That would be an extremely complex AI system to build. Some high level steps would be:
Collect a large dataset of screenshots, human instructions, and corresponding keyboard/mouse interactions. This dataset would be crucial for training.
Use a CNN to extract visual features from the screenshots. Something like VGG16 or ResNet would work well here.
Use an RNN/LSTM to encode the sequence of instructions. BERT or RoBERTa would also be good options here.
Combine the visual and textual features. Options include concatenation, addition, or a more complex fusion method.
Use a Transformer like GPT-3 or Visual-Transformers to generate the keyboard/mouse interactions from the combined visual+textual features.
Train the entire model end-to-end to minimize the difference between the generated interactions and the ground truth interactions from your dataset.
This would likely take a team of ML engineers and researchers to develop, and would require lots of time, data, and compute.
Some sample Python code would be:
import torchvision.models as models
import torch
import torch.nn as nn
from transformers import RobertaModel, GPT2LMHeadModel
# Feature Extractors
cnn = models.resnet50(pretrained=True)
roberta = RobertaModel.from_pretrained("roberta-base")
# Fusion
fusion = nn.Sequential(nn.Linear(2048 + 768, 1024), nn.ReLU(), nn.Linear(1024, 512))
# Generation Model
generator = GPT2LMHeadModel.from_pretrained("gpt2")
# Forward pass
img_features = cnn(img)
instruction_features = roberta(instructions)[0]
combined_features = fusion(torch.cat((img_features, instruction_features), dim=1))
preds = generator(inputs_embeds=combined_features)[0]
This is just a brief sample, but hopefully gives you an idea of the components and flow. Let me know if you have any other questions!

GPT4

Designing an AI model that combines GPT, RNN, LSTM, and ViT for the task you described is complex and requires deep understanding of these architectures. However, I’ll provide you with a high-level overview and a basic structure of the code to help you get started.

  1. First, process the input data:
  • For computer display screenshots, use a Vision Transformer (ViT) model to extract features.

  • For human instructions, use a GPT model to encode the natural language instructions.

  1. Combine the features from both the ViT and GPT models.

  2. Use an LSTM or RNN to process the combined features and maintain the temporal context.

  3. Finally, decode the output of the LSTM/RNN into keyboard and mouse events.

Here’s a basic structure of the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import torch
import torchvision
from transformers import GPT2Tokenizer, GPT2Model
# Load pretrained models
gpt_model = GPT2Model.from_pretrained('gpt2')
vit_model = torchvision.models.vit_b_16(pretrained=True)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
class CustomModel(torch.nn.Module):
def __init__(self, gpt_model, vit_model, hidden_size, output_size):
super(CustomModel, self).__init__()
self.gpt_model = gpt_model
self.vit_model = vit_model
self.rnn = torch.nn.LSTM(input_size=hidden_size, hidden_size=hidden_size, batch_first=True)
self.fc = torch.nn.Linear(hidden_size, output_size)
def forward(self, instructions, screenshots):
# Encode instructions using GPT2
inputs = tokenizer(instructions, return_tensors="pt", padding=True, truncation=True)
gpt_output = self.gpt_model(**inputs)
gpt_features = gpt_output.last_hidden_state[:, 0, :]
# Encode screenshots using ViT
vit_output = self.vit_model(screenshots)
vit_features = vit_output["last_hidden_state"][:, 0, :]
# Combine features
combined_features = torch.cat((gpt_features, vit_features), dim=1)
# Process combined features through an RNN
rnn_output, _ = self.rnn(combined_features.unsqueeze(1))
# Decode the output into keyboard and mouse events
output = self.fc(rnn_output.squeeze(1))
return output
# Instantiate the custom model
model = CustomModel(gpt_model, vit_model, hidden_size=1024, output_size=256)
# Train and evaluate the model using your dataset

You’ll need to collect a dataset with computer display screenshots, human instructions, and paired keyboard and mouse events. Then, preprocess the data, split it into training and validation sets, and train the model using an appropriate loss function and optimizer.

Keep in mind that this is just a starting point, and you may need to experiment with different architectures, loss functions, and hyperparameters to achieve the desired performance.

Read More

2023-04-02
Chatgpt Local Version

Run some community contributed ChatGPT-like models on commondity PCs.

Model Selection

Below are some models we are about to use:

There are quite a few more models to be listed. You can check this curated open-sourced ChatGPT-like model list for updates. But for now, these models shall be sufficient.

Quantization and Optimization

Floating-point values in model weights are stored as 32bit. Quantization can reduce storage space and computation by switching to 16bit, 8bit or 4bit values. However, most quantized models cannot be trained or fine-tuned, some 16bit models can only be trained on certain architecture of GPUs, such as Ada and Turing.

To make LLM (Large Language Model) inference feasible on common hardware, GPU is usually mandatory. However, most commondity GPUs have smaller VRAM compared to RAM, limiting the size of LLM to be run, thus the capability of the LLM. Most computer have 12GB of VRAM, 32GB of RAM. GGML is a project aiming to make LLM inference on CPU as fast as GPU, utilizing larger RAM compared to VRAM to run larger LLMs. Currently some popular LLMs have been ported to GGML, like LLaMA and Alpaca.

Training and Fine-tuning

In deeplearning, people tend to tune all parameters during training, requiring much VRAM and time. To train GPT3.5 aka ChatGPT, OpenAI spends millions to rent interconnected A100 GPUs. This is impossible for an individual to afford such.

With technologies like LoRA, by freezing most part of the model and introducing a small fraction of tunable parameters, training requirements can be greatly reduced. One can easily tune 7B LLaMA or 14B RWKV using LoRA on a PC (usually rented on the cloud, such as AutoDL) with a single 80GB A100 card and 200GB of RAM.

Prompting and Chaining

LLMs are general problem solvers given enough external storage and access to search engines. Text is the only way to language models (not for multimodal LLMs, like GPT4, OFA or UniLM).

To enhance the capability of LLMs, you have to maintain its memory, define action keywords and trigger external actions during the conversation, connect it to semantic search engines powered by other AI models like sentence transformers.

One such library is LangChain.

Serving as API

The process of generation for LLMs is sequential. Server needs to maintain a streaming API to match this behavior. Tokens are fetched one by one from the server with a constant speed, revealed in the frontend.

One can check third-party frontend-only or self-hosted projects for conversational LLMs for reference.

Read More

2023-04-02
Fastapi, Celery, Task Queue, Websocket

enable render option trim_blocks and lstrip_blocks with jinja2 to avoid whitespace and indentation nightmare.


always remember to import uvicorn if you want to run without the uvicorn executable


generate nodejs client from openapi.json

fastapi-code-generator to generate python code


create doc inside code: adding metadata


to share lock across process, use redis lock or filelock.

to share lock across forked process in the same worker, use multiprocessing.Lock()


fastapi can generate openapi json and doc page

websockets are async. will it block the server?

using websocket in fastapi

celery advance usage

celery and fastapi

happen to found akismet (proprietary wordpress spam protection). oss alternatives are:

Read More