Page 7 | Blog of James Brown

Autonomous Machines & Society.

2024-03-16

Staying Focus

People who need to stay focus for a long time:

Writers
Hackers
Programmers
Professional Gamers

Find the thing they take, method they use for staying focus and productive.

Sugar-free energy drinks can be better than Redbull.

2024-03-14

Cybergod-Like Agents, General Computer Control

matmul-free llm

https://arxiv.org/abs/2406.02528

https://github.com/KingNishHF/OpenGPT-4o

aider coding assist devon alternative

kyutai moshi gpt4o alternative

firefunction v2 function calling llm

codegeex4-all-9b

https://github.com/lavague-ai/LaVague

https://github.com/Upsonic/Tiger

computer agents:

https://github.com/slavakurilyak/awesome-ai-agents

gui agent model trained on gui-world

gui agent datasets on huggingface

autocoder with pretrained models, has access to terminal:

https://github.com/bin123apple/AutoCoder

you can label the gui manually, write comments to each ui element and write exact operate steps about the exact execution steps.

GUI detection algorithm:

https://github.com/MulongXie/UIED

minified segment anything model:

https://github.com/xinghaochen/TinySAM

https://github.com/graylan0/gptcomputer

https://github.com/patterns-complexity/gpt-pc-control

https://github.com/b5marwan/gpt-vision-agent

https://github.com/rogeriochaves/driver

https://github.com/s-a-ng/control-pc-with-gpt4-vision

gpt related:

https://github.com/szczyglis-dev/py-gpt

https://github.com/EwingYangs/awesome-open-gpt

gpt-4o is gaining popularity in computer control.

https://github.com/CK92149/GPTComputerAutomation

https://github.com/onuratakan/gpt-computer-assistant

https://github.com/kyegomez/GPT4o

terminal controlling agent:

https://github.com/greshake/Alice

Simulated computer control environments:

https://github.com/xlang-ai/OSWorld

Multi-agent framework, routing:

https://python.langchain.com/v0.1/docs/langgraph

Devin open source alternative:

https://github.com/entropy-research/Devon

https://github.com/stitionai/devika

https://github.com/semanser/codel

Web browsing agent:

https://github.com/THUDM/AutoWebGLM

Agent-Eval-Refine contains models for GUI captioning, iOS finetuned CogAgent, and several GUI agent datasets.

ScreenAgent includes a lots of related computer control papers and projects in, along with a self-trained model on huggingface.

Similar projects:

https://github.com/TobiasNorlund/UI-Act

Listed projects:

https://github.com/x-plug/mobileagent

https://github.com/google-research/google-research/tree/master/screen2words

https://github.com/rainyugg/blip-adapter

https://github.com/imnearth/coat

https://github.com/xbmxb/aagent

https://github.com/princeton-nlp/ptp

https://github.com/njucckevin/seeclick

https://github.com/thudm/autowebglm

https://github.com/OS-Copilot/OS-Copilot

Environments:

https://github.com/google-deepmind/android_env

https://github.com/x-lance/mobile-env

Datasets:

https://github.com/google-research-datasets/screen_qa

Open-Interface utilizes GPT-4V to control computer interface.

Devin is an AI agent that can solve many real-world Github issues, with access to browser, terminal and code editor.

Cradle is a general computer controlling agent developed to play Red Dead Redeption II.

Pythagora aka GPT Pilot is a true AI developer that writes code, debugs it, talks to you when it need.

Devin open source counterparts:

GPA-LM: a list of game playing agents

2024-03-13

Do Not Use Better_Exception With Multithreading Or Multiprocessing

when using it with things like sse_starlette, traceback of child process/thread will be invisible.

2024-03-10

Ai Assisted Content Creation, Gameplay Video Recording, Trending Topics

code to video

https://github.com/redotvideo/revideo

fishaudio voice cloning

omniparse data serialization

Video understanding and video embedding can be achieved with ViViT (in huggingface).

Video generation agent tutorial

MoneyPrinterTurbo

Mini Gemini

Use enhancr for frame interpolation, super resolution and scaling. The pro version contains faster models.

The app is built using electron forge.

Interpolation gets worse with higher resolution, that’s why I wouldn’t upscale first.

enhancr is built upon the following models:

Interpolation

RIFE (NCNN) - megvii-research/ECCV2022-RIFE - powered by styler00dollar/VapourSynth-RIFE-NCNN-Vulkan

RIFE (TensorRT) - megvii-research/ECCV2022-RIFE - powered by AmusementClub/vs-mlrt & styler00dollar/VSGAN-tensorrt-docker

GMFSS - Union (PyTorch/TensorRT) - 98mxr/GMFSS_Union - powered by HolyWu/vs-gmfss_union

GMFSS - Fortuna (PyTorch/TensorRT) - 98mxr/GMFSS_Fortuna - powered by HolyWu/vs-gmfss_fortuna

CAIN (NCNN) - myungsub/CAIN - powered by mafiosnik/vsynth-cain-NCNN-vulkan (unreleased)

CAIN (DirectML) - myungsub/CAIN - powered by AmusementClub/vs-mlrt

CAIN (TensorRT) - myungsub/CAIN - powered by HubertSotnowski/cain-TensorRT

Upscaling

ShuffleCUGAN (NCNN) - styler00dollar/VSGAN-tensorrt-docker - powered by AmusementClub/vs-mlrt

ShuffleCUGAN (TensorRT) - styler00dollar/VSGAN-tensorrt-docker - powered by AmusementClub/vs-mlrt

RealESRGAN (NCNN) - xinntao/Real-ESRGAN - powered by AmusementClub/vs-mlrt

RealESRGAN (DirectML) - xinntao/Real-ESRGAN - powered by AmusementClub/vs-mlrt

RealESRGAN (TensorRT) - xinntao/Real-ESRGAN - powered by AmusementClub/vs-mlrt

RealCUGAN (TensorRT) - bilibili/ailab/Real-CUGAN - powered by AmusementClub/vs-mlrt

SwinIR (TensorRT) - JingyunLiang/SwinIR - powered by mafiosnik777/SwinIR-TensorRT (unreleased)

Restoration

DPIR (DirectML) - cszn/DPIR - powered by AmusementClub/vs-mlrt

DPIR (TensorRT) - cszn/DPIR - powered by AmusementClub/vs-mlrt

SCUNet (TensorRT) - cszn/SCUNet - powered by mafiosnik777/SCUNet-TensorRT (unreleased)

Kdenlive has many video editing features, like automatic scene split, video stabilzation.

To extract existing hard-coded subtitles in videos, use videosubfinder, which is used in Cradle, an Red Dead Redemption II agent.

To check if audio is recorded, we can view amplitude instead of hearing.

1
2

ffprobe -f lavfi -i "amovie=<audio_or_video_filepath>,astats=metadata=1:reset=1" -show_entries frame=pkt_pts_time:frame_tags=lavfi.astats.Overall.RMS_level -of default=noprint_wrappers=1:nokey=1 -sexagesimal -v error

AI toolbox: a comprehensive content creation toolbox with links to related projects

Use streamlit to write interactive interfaces for video labeling, editing and registration, tracking viewer counts.

Grided image can be used for image selection prompting and image condensation, putting multiple images together to save processing power during tasks like video rating.

When you play video games on low end devices, you can tune down the resolution and image quality, to ensure 30 FPS.

If you change screen resolution during screen recording, you might lose your view.

Train a video grading system with recent and relevant video grades, and when evaluating put grading context into the prompt, thus generalize the system.

Get system predicted labels of video content to train a label predictor out of it, providing necessary context of test video for improving the grading system accuracy.

Taskmatrix is a multimodal agent framework suitable for multiple types of image editing, using diffusion models.

You can learn what the viewers are craving about via recommendation engines, dynamic posts and latest bangumi releases.

Post the same content across multiple platforms to increase view counts.

2024-03-07

Nginx Use As Application Remapper

for more accurate results, use nginx-geoip2

to disable access by ip origin while exclude certain ranges, you can do this:

1 2	apt install -y libnginx-mod-http-geoip libgeoip

http {
geoip_country /usr/share/GeoIP/GeoIP.dat;
geoip_proxy <internal_ip_ranges>;
geo $external_ip {
default 1;
<custom_exclusion_range> 0;
}
log_format geologfmt '$remote_addr - $remote_user <$geoip_country_code> [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent"';
access_log /var/log/nginx/access.log geologfmt;
}
server {
# boolean operator in nginx
# better do this with njs instead
set $a 0;
set $b 0;
if ($geoip_country_code != "<YOUR_COUNTRY_CODE>"){
set $a 1;
}
if ($external_ip){
set $a 1$a;
}
if ($a = 11){
set $b 1;
}
if ($b){return 444;}
}

remember to restart nginx service afterwards

enable http basic auth

1 2	sudo htpasswd -c /etc/nginx/passwd <username>

server {
auth_basic "<auth_window_title>";
auth_basic_user_file passwd;
}

Run nginx with debug info:

# within /etc/nginx/nginx.conf
http {
access_log /var/log/nginx/access.log debug;
error_log /var/log/nginx/error.log debug;
}

Remap a range of ports to suburl:

location ~ /server/(1[0-9][0-9][0-9][0-9]) {
proxy_pass http://localhost:$1/;
}

To handle CORS errors, one can write:

location /api/ {
# add_header Access-Control-Allow-Origin *;
add_header Access-Control-Allow-Origin $http_origin;
add_header 'Access-Control-Allow-Headers' 'Content-Type';
if ($request_method = OPTIONS) {
add_header Access-Control-Allow-Origin *;
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS';
add_header 'Access-Control-Allow-Headers' 'Content-Type, x-requested-with';
add_header 'Access-Control-Max-Age' 1728000;
add_header 'Content-Type' 'text/plain; charset=utf-8';
return 204;
}
proxy_pass http://localhost:7861/;
# proxy_set_header X-Forwarded-Prefix /api;
sub_filter "openapi.json" "api/openapi.json";
# sub_filter "SwaggerUIBundle({" "SwaggerUIBundle({ basePath: '/api', 'servers': [{url:'/api'}],";
sub_filter "static-offline-docs" "api/static-offline-docs";
sub_filter_once off;
}

After installing nginx, a default page is created under /var/www/html and config files are at /etc/nginx.

Run sudo systemctl restart nginx after config modification.

Edit the file at /etc/nginx/sites-available/default:

server {
listen 80;
server_name localhost;
location /app1 {
proxy_pass http://localhost:8000;
}
location /app2 { # websocket support
proxy_pass http://localhost:8001;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Origin ""; // to prevent unwanted 403
}
}

If you want to route FastAPI docs to nginx, you have to rewrite contents.

server {
location /vllm/openapi.json {
proxy_pass http://localhost:8000/openapi.json;
sub_filter "\"paths\"" "\"servers\": [{\"url\": \"/vllm\"}], \"paths\"";
sub_filter_types application/json;
}
location /vllm/ {
proxy_pass http://localhost:8000/;
sub_filter "openapi.json" "vllm/openapi.json";
sub_filter_once off;
}
}

2024-03-05

Goovis The Right Way

床上不适合使用Goovis，因为容易睡着和疲倦。

躺椅的靠背在把背部完全贴合之后，不能有效支撑头部。需要头枕支撑。

把头枕调整到头骨突出处，而不是脖子，不然会导致头部不舒服。

一定要完全贴合躺椅靠背，不然会让臀部血液循环不畅。

不要玩类似于赛博朋克2077之类的游戏，会让人产生视觉疲劳。

屏幕亮度需要调低，不论是普通显示屏还是头显。

2024-03-05

Routing Requests With Flask, With Extra Authentication Headers

Run the following like:

1
2
3

pip3 install gunicorn
gunicorn --bind localhost:8001 <file_name_without_extension>:app

import argparse
import requests
from flask import Flask, Response, request
import json
parser = argparse.ArgumentParser(description='Argument Parser with Default Parameters')
parser.add_argument('--source_port', type=int, default=8000, help='Source Port Number')
args, _ = parser.parse_known_args()
print("Source Port:", args.source_port)
source_port = args.source_port
app = Flask(__name__)
sess = requests.Session()
AUTH_TOKEN = "auth_token"
AUTH_HEADER_KEY = "Auth"
GET = "GET"
POST = "POST"
ALLOWED_METHODS = [GET, POST]
@app.route('/', defaults={'path': ''}, methods=ALLOWED_METHODS)
@app.route('/<path:path>', methods=ALLOWED_METHODS)
def chat_completions(path):
url = f'http://localhost:{source_port}{request.full_path}'
# full path is prefixed with /
request_headers = dict(request.headers)
auth = request_headers.get(AUTH_HEADER_KEY, None)
if auth != AUTH_TOKEN:
return Response(json.dumps({"state": "unauthorized", "message": "Unauthorized access"}), status = 401)
no_auth_headers = {k:v for k,v in request_headers.items() if k !=AUTH_HEADER_KEY}
if request.method == GET:
response = sess.get(url, stream=True, headers=no_auth_headers)
else:
response = sess.post(url, stream=True, headers=no_auth_headers, data=request.data, files=request.files, json=request.json)
# form is not accepted.
# response = sess.post(url, stream=True, headers=no_auth_headers, data=request.data, form=request.form, files=request.files, json=request.json)
def generate():
for chunk in response.iter_content(chunk_size=1024):
yield chunk
return Response(generate(), content_type=response.headers['content-type'], headers = dict(response.headers))
if __name__ == '__main__':
app.run()

2024-03-01

Serve Models From Replicate Locally

Replicate internally use Cog for packaging and serving large AI models in Docker containers. Currently it only supports macOS and Linux.

According to the doc it offers nearly the same functionally as Replicate such as API calls, fine-tuning.

You may connect your local LLM to VSCode using Continue, an open-source Copilot alternative.

2024-03-01

Image And Audio Feature Extraction For Language Models

Use ipython instead of python to test these code, get better parameter hints and dynamic hints, just like what you saw in brownie.

Use dataset.transform instead of dataset.map to save loading time.

Image processing

Many language models resize, reshape & pad the input image into 224x224 square and put into ViT directly.

To simplify the pipeline, we would recommend you to sample the image into fixed size square patches, like 2x2, 4x4 etc.

Or you can skip the ViT embedding part, just use Fuyu-8b or take its architecture FuyuForCausalLM and processor FuyuProcessor because it supports arbitrary sized images.

from transformers import FuyuProcessor, FuyuForCausalLM
from PIL import Image
import requests
model_name = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(model_name)
model = FuyuForCausalLM.from_pretrained(model_name)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "Generate a coco-style caption.\n"
inputs = processor(text=text_prompt, images=image, return_tensors="pt")
outputs = model(**inputs)
generated_ids = model.generate(**model_inputs, max_new_tokens=7)
generation_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generation_text)

Split image into patches

Usually images are large so we need to split.

You have three ways to split an image.

Patchify

The splited indexs are put in front instead of appended back.

import numpy as np
from patchify import patchify
image = np.random.rand(512,512,3)
patches = patchify(image, (128,128,3), step=128)
print(patches.shape) # (4, 4, 1, 128, 128, 3)

Torch `unfold`

It works by expanding target dimension and appending a new dimension corresponding to it.

import torch
image = torch.rand(512,512,3)
patches = image.unfold(0, 128, 128).unfold(1, 128, 128).unfold(2, 3, 3)
print(patches.shape) # torch.Size([4, 4, 1, 128, 128, 3])

EMPatches

import numpy as np
from empatches import EMPatches
image = np.random.rand(512, 512, 3)
emp = EMPatches()
patches, indices = emp.extract_patches(image, patchsize = 128, overlap = 0)
print(patches) # a list of numpy arrays, total 16 items
print(indices) # [(x_start, x_end, y_start, y_end), ...], total 16 items

Convert fixed-size patches into embeddings

The embeddings from ViT cannot be used directly by LLM. Instead, use LayerNorm and Dense as simple adaptors.

The first token is the class token, randomly initialized and processed along with the transformer, output as the summary of the full image, can be extracted for image embedding.

Proof:

224x224 is the shape of input image
16x16 is the patch size
224/16 = 14
14*14 + 1 = 197

import torch
import transformers
# not torch.randn (sample from normal distribution)
image = torch.rand(3, 224, 224) # chw
model_name = "google/vit-base-patch16-224-in21k"
processor = transformers.AutoImageProcessor(model_name) # for processing image
image = processor(image, do_rescale=False) # use this parameter when passing values ranging from 0 to 1
#image = processor(pil_image) # can also handle pil image
model = transformers.ViTModel(model_name)
outputs = model(pixel_values = image)
embeddings = outputs.last_hidden_state[:,0,:] # torch.Size([1, 768])

Audio processing

An useful and related field to speaker diarization in video processing is visual entity recognization, which can help you identify anime or movie characters across different frames.

When unsure, the agent shall consult online search engines, subtitles and existing recognized entities for classification. If a dataset is successfully created, one can train a YOLO model to speed up the process, used along with popular person/anime head detection models.

In most videos speakers and visuals are aligned. You can first identify speakers then get character identification. Remember you need to use special pipeline for long-time diarization, sharing speaker features for cross-audio diarization.

For multilanguage context, you would like to use speaker detection models like pyannote. Diart is a speech processing library based on that and can be used in real time, with speaker diarization, voice activity detection training pipelines.

Whisper-streaming uses LocalAgreement algoritm to segment chunks of audio and merge common patterns.

Whisper architecture is comprised of an audio encoder and transcription decoder. The output of the encoder is feed into every cross attention layer of the decoder. For feature extraction, you only need to use the encoder.

You pass single channel audio amplitude array to audio feature extractors with predetermined audio sample rate. If the sample rate mismatch, you need to resample the audio.

Different audio transformers choose different context window sizes. Like LLMs, they can be streamed. However during training they must use a fixed context size.

For Whisper, the context size is 30 seconds. Confugurable at: transformers.WhisperFeatureExtractor(chunk_length=30, ...)

For AST, it is 10.24 seconds. You can find more info about input and output sizes here. Configurable at: transformers.ASTFeatureExtractor(max_length=1024, ...)

These numbers can be found over respective processor parameters.

from transformers import AutoProcessor, ASTModel
import torch
from dataset import load_dataset
dataset_name = "hf_internal_testing/librispeech_asr_demo"
model_name = "MIT/ast-finetuned-audioset-10-10-0.4593"
dataset = load_dataset(dataset_name, 'clean', split="validation")
sampling_rate = dataset.features["audio"].sampling_rate
processor = AutoProcessor.from_pretrained(model_name)
model = ASTModel.from_pretrained(model_name)
audio_array = dataset[0].audio.array
inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
pooler_output = outputs["pooler_output"]

2024-02-29

The Way To Prosper And Serve

do not use wifi connection for servers since this will lead to wifi card overheating. always use ethernet

Server grade CPU like AMD EPYC is capable of doing AI inference, and has a massive amount of RAM.

Keep your machine plugged, 24/7, even if it is just a Raspberry Pi.

Many times you face some problem like connection error, program unexpected exits, system freeze etc. Your purpose is to add more features to the system and never let it powered down.

If you cannot afford the power consumption, start with some small server, ramp it up for a long time.

Celeron MiniPC is very cheap. However SSD and large RAM are required for quick response.

You need OCuLink for eGPU connection.

When using smartphones and laptops as server, if no software level charging limit is supported, you need hardware switch controller, controlled by USB or HomeAssistant compatible protocols.

Blog of James Brown

2024-03-16 Staying Focus

2024-03-14 Cybergod-Like Agents, General Computer Control

2024-03-13 Do Not Use Better_Exception With Multithreading Or Multiprocessing

2024-03-10 Ai Assisted Content Creation, Gameplay Video Recording, Trending Topics