People who need to stay focus for a long time:
Writers
Hackers
Programmers
Professional Gamers
Find the thing they take, method they use for staying focus and productive.
Sugar-free energy drinks can be better than Redbull.
People who need to stay focus for a long time:
Writers
Hackers
Programmers
Professional Gamers
Find the thing they take, method they use for staying focus and productive.
Sugar-free energy drinks can be better than Redbull.
matmul-free llm
https://arxiv.org/abs/2406.02528
https://github.com/KingNishHF/OpenGPT-4o
aider coding assist devon alternative
kyutai moshi gpt4o alternative
firefunction v2 function calling llm
codegeex4-all-9b
https://github.com/lavague-ai/LaVague
https://github.com/Upsonic/Tiger
computer agents:
https://github.com/slavakurilyak/awesome-ai-agents
gui agent model trained on gui-world
gui agent datasets on huggingface
autocoder with pretrained models, has access to terminal:
https://github.com/bin123apple/AutoCoder
you can label the gui manually, write comments to each ui element and write exact operate steps about the exact execution steps.
GUI detection algorithm:
https://github.com/MulongXie/UIED
minified segment anything model:
https://github.com/xinghaochen/TinySAM
https://github.com/graylan0/gptcomputer
https://github.com/patterns-complexity/gpt-pc-control
https://github.com/b5marwan/gpt-vision-agent
https://github.com/rogeriochaves/driver
https://github.com/s-a-ng/control-pc-with-gpt4-vision
gpt related:
https://github.com/szczyglis-dev/py-gpt
https://github.com/EwingYangs/awesome-open-gpt
gpt-4o is gaining popularity in computer control.
https://github.com/CK92149/GPTComputerAutomation
https://github.com/onuratakan/gpt-computer-assistant
https://github.com/kyegomez/GPT4o
terminal controlling agent:
https://github.com/greshake/Alice
Simulated computer control environments:
https://github.com/xlang-ai/OSWorld
Multi-agent framework, routing:
https://python.langchain.com/v0.1/docs/langgraph
Devin open source alternative:
https://github.com/entropy-research/Devon
https://github.com/stitionai/devika
https://github.com/semanser/codel
Web browsing agent:
https://github.com/THUDM/AutoWebGLM
Agent-Eval-Refine contains models for GUI captioning, iOS finetuned CogAgent, and several GUI agent datasets.
ScreenAgent includes a lots of related computer control papers and projects in, along with a self-trained model on huggingface.
Similar projects:
https://github.com/TobiasNorlund/UI-Act
Listed projects:
https://github.com/x-plug/mobileagent
https://github.com/google-research/google-research/tree/master/screen2words
https://github.com/rainyugg/blip-adapter
https://github.com/imnearth/coat
https://github.com/xbmxb/aagent
https://github.com/princeton-nlp/ptp
https://github.com/njucckevin/seeclick
https://github.com/thudm/autowebglm
https://github.com/OS-Copilot/OS-Copilot
Environments:
https://github.com/google-deepmind/android_env
https://github.com/x-lance/mobile-env
Datasets:
https://github.com/google-research-datasets/screen_qa
Open-Interface utilizes GPT-4V to control computer interface.
Devin is an AI agent that can solve many real-world Github issues, with access to browser, terminal and code editor.
Cradle is a general computer controlling agent developed to play Red Dead Redeption II.
Pythagora aka GPT Pilot is a true AI developer that writes code, debugs it, talks to you when it need.
Devin open source counterparts:
GPA-LM: a list of game playing agents
when using it with things like sse_starlette
, traceback of child process/thread will be invisible.
code to video
https://github.com/redotvideo/revideo
fishaudio voice cloning
omniparse data serialization
Video understanding and video embedding can be achieved with ViViT (in huggingface).
Video generation agent tutorial
Use enhancr for frame interpolation, super resolution and scaling. The pro version contains faster models.
The app is built using electron forge.
Interpolation gets worse with higher resolution, that’s why I wouldn’t upscale first.
enhancr is built upon the following models:
RIFE (NCNN) - megvii-research/ECCV2022-RIFE - powered by styler00dollar/VapourSynth-RIFE-NCNN-Vulkan
RIFE (TensorRT) - megvii-research/ECCV2022-RIFE - powered by AmusementClub/vs-mlrt & styler00dollar/VSGAN-tensorrt-docker
GMFSS - Union (PyTorch/TensorRT) - 98mxr/GMFSS_Union - powered by HolyWu/vs-gmfss_union
GMFSS - Fortuna (PyTorch/TensorRT) - 98mxr/GMFSS_Fortuna - powered by HolyWu/vs-gmfss_fortuna
CAIN (NCNN) - myungsub/CAIN - powered by mafiosnik/vsynth-cain-NCNN-vulkan (unreleased)
CAIN (DirectML) - myungsub/CAIN - powered by AmusementClub/vs-mlrt
CAIN (TensorRT) - myungsub/CAIN - powered by HubertSotnowski/cain-TensorRT
ShuffleCUGAN (NCNN) - styler00dollar/VSGAN-tensorrt-docker - powered by AmusementClub/vs-mlrt
ShuffleCUGAN (TensorRT) - styler00dollar/VSGAN-tensorrt-docker - powered by AmusementClub/vs-mlrt
RealESRGAN (NCNN) - xinntao/Real-ESRGAN - powered by AmusementClub/vs-mlrt
RealESRGAN (DirectML) - xinntao/Real-ESRGAN - powered by AmusementClub/vs-mlrt
RealESRGAN (TensorRT) - xinntao/Real-ESRGAN - powered by AmusementClub/vs-mlrt
RealCUGAN (TensorRT) - bilibili/ailab/Real-CUGAN - powered by AmusementClub/vs-mlrt
SwinIR (TensorRT) - JingyunLiang/SwinIR - powered by mafiosnik777/SwinIR-TensorRT (unreleased)
DPIR (DirectML) - cszn/DPIR - powered by AmusementClub/vs-mlrt
DPIR (TensorRT) - cszn/DPIR - powered by AmusementClub/vs-mlrt
SCUNet (TensorRT) - cszn/SCUNet - powered by mafiosnik777/SCUNet-TensorRT (unreleased)
Kdenlive has many video editing features, like automatic scene split, video stabilzation.
To extract existing hard-coded subtitles in videos, use videosubfinder, which is used in Cradle, an Red Dead Redemption II agent.
To check if audio is recorded, we can view amplitude instead of hearing.
1 | ffprobe -f lavfi -i "amovie=<audio_or_video_filepath>,astats=metadata=1:reset=1" -show_entries frame=pkt_pts_time:frame_tags=lavfi.astats.Overall.RMS_level -of default=noprint_wrappers=1:nokey=1 -sexagesimal -v error |
AI toolbox: a comprehensive content creation toolbox with links to related projects
Use streamlit
to write interactive interfaces for video labeling, editing and registration, tracking viewer counts.
Grided image can be used for image selection prompting and image condensation, putting multiple images together to save processing power during tasks like video rating.
When you play video games on low end devices, you can tune down the resolution and image quality, to ensure 30 FPS.
If you change screen resolution during screen recording, you might lose your view.
Train a video grading system with recent and relevant video grades, and when evaluating put grading context into the prompt, thus generalize the system.
Get system predicted labels of video content to train a label predictor out of it, providing necessary context of test video for improving the grading system accuracy.
Taskmatrix is a multimodal agent framework suitable for multiple types of image editing, using diffusion models.
You can learn what the viewers are craving about via recommendation engines, dynamic posts and latest bangumi releases.
Post the same content across multiple platforms to increase view counts.
for more accurate results, use nginx-geoip2
to disable access by ip origin while exclude certain ranges, you can do this:
1 | apt install -y libnginx-mod-http-geoip libgeoip |
1 | http { |
remember to restart nginx service afterwards
enable http basic auth
1 | sudo htpasswd -c /etc/nginx/passwd <username> |
1 | server { |
Run nginx with debug info:
1 | # within /etc/nginx/nginx.conf |
Remap a range of ports to suburl:
1 | location ~ /server/(1[0-9][0-9][0-9][0-9]) { |
To handle CORS errors, one can write:
1 | location /api/ { |
After installing nginx, a default page is created under /var/www/html
and config files are at /etc/nginx
.
Run sudo systemctl restart nginx
after config modification.
Edit the file at /etc/nginx/sites-available/default
:
1 | server { |
If you want to route FastAPI docs to nginx, you have to rewrite contents.
1 | server { |
床上不适合使用Goovis,因为容易睡着和疲倦。
躺椅的靠背在把背部完全贴合之后,不能有效支撑头部。需要头枕支撑。
把头枕调整到头骨突出处,而不是脖子,不然会导致头部不舒服。
一定要完全贴合躺椅靠背,不然会让臀部血液循环不畅。
不要玩类似于赛博朋克2077之类的游戏,会让人产生视觉疲劳。
屏幕亮度需要调低,不论是普通显示屏还是头显。
Run the following like:
1 | pip3 install gunicorn |
1 | import argparse |
Replicate internally use Cog for packaging and serving large AI models in Docker containers. Currently it only supports macOS and Linux.
According to the doc it offers nearly the same functionally as Replicate such as API calls, fine-tuning.
You may connect your local LLM to VSCode using Continue, an open-source Copilot alternative.
Use ipython
instead of python
to test these code, get better parameter hints and dynamic hints, just like what you saw in brownie
.
Use dataset.transform
instead of dataset.map
to save loading time.
Many language models resize, reshape & pad the input image into 224x224 square and put into ViT directly.
To simplify the pipeline, we would recommend you to sample the image into fixed size square patches, like 2x2, 4x4 etc.
Or you can skip the ViT embedding part, just use Fuyu-8b or take its architecture FuyuForCausalLM
and processor FuyuProcessor
because it supports arbitrary sized images.
1 | from transformers import FuyuProcessor, FuyuForCausalLM |
Usually images are large so we need to split.
You have three ways to split an image.
The splited indexs are put in front instead of appended back.
1 | import numpy as np |
unfold
It works by expanding target dimension and appending a new dimension corresponding to it.
1 | import torch |
1 | import numpy as np |
The embeddings from ViT cannot be used directly by LLM. Instead, use LayerNorm
and Dense
as simple adaptors.
The first token is the class token, randomly initialized and processed along with the transformer, output as the summary of the full image, can be extracted for image embedding.
Proof:
1 | 224x224 is the shape of input image |
1 | import torch |
An useful and related field to speaker diarization in video processing is visual entity recognization, which can help you identify anime or movie characters across different frames.
When unsure, the agent shall consult online search engines, subtitles and existing recognized entities for classification. If a dataset is successfully created, one can train a YOLO model to speed up the process, used along with popular person/anime head detection models.
In most videos speakers and visuals are aligned. You can first identify speakers then get character identification. Remember you need to use special pipeline for long-time diarization, sharing speaker features for cross-audio diarization.
For multilanguage context, you would like to use speaker detection models like pyannote. Diart is a speech processing library based on that and can be used in real time, with speaker diarization, voice activity detection training pipelines.
Whisper-streaming uses LocalAgreement algoritm to segment chunks of audio and merge common patterns.
Whisper architecture is comprised of an audio encoder and transcription decoder. The output of the encoder is feed into every cross attention layer of the decoder. For feature extraction, you only need to use the encoder.
You pass single channel audio amplitude array to audio feature extractors with predetermined audio sample rate. If the sample rate mismatch, you need to resample the audio.
Different audio transformers choose different context window sizes. Like LLMs, they can be streamed. However during training they must use a fixed context size.
For Whisper, the context size is 30 seconds. Confugurable at: transformers.WhisperFeatureExtractor(chunk_length=30, ...)
For AST, it is 10.24 seconds. You can find more info about input and output sizes here. Configurable at: transformers.ASTFeatureExtractor(max_length=1024, ...)
These numbers can be found over respective processor parameters.
1 | from transformers import AutoProcessor, ASTModel |
do not use wifi connection for servers since this will lead to wifi card overheating. always use ethernet
Server grade CPU like AMD EPYC is capable of doing AI inference, and has a massive amount of RAM.
Keep your machine plugged, 24/7, even if it is just a Raspberry Pi.
Many times you face some problem like connection error, program unexpected exits, system freeze etc. Your purpose is to add more features to the system and never let it powered down.
If you cannot afford the power consumption, start with some small server, ramp it up for a long time.
Celeron MiniPC is very cheap. However SSD and large RAM are required for quick response.
You need OCuLink for eGPU connection.
When using smartphones and laptops as server, if no software level charging limit is supported, you need hardware switch controller, controlled by USB or HomeAssistant compatible protocols.