Multimodal Autoregressive Unsupervised Learning

multimodal unsupervised learning
Kaggle login security
GPT2 and LLaMA token encoding
ViT adaptation
Genie architecture
image/video editing with prompts
warnings against NTFS in Linux
PyTorch GPU installation
environment variable settings
manual token encoding
embedding customization
multimodal learning applications
The article delves into various topics including multimodal unsupervised learning, Kaggle login security, token encoding with GPT2 and LLaMA, ViT adaptation, Genie architecture, prompt-based image/video editing, and cautions against NTFS usage in Linux. It also offers detailed instructions on PyTorch GPU installation, environment variable settings, manual token encoding, custom embedding, and exploring multimodal learning applications.
Published

February 28, 2024


It is so good that I can login to Kaggle on smartphone.


Instruction following image editing via prompt engineering like Mini DALLE3 or multimodal model like CogCoM


Recently search engine & browser augmented generation has become popular. A great tool for creating video scripts.


Google has released a new architecure called Genie, which can generate latent action space only using video unsupervised training.


Do not ever use NTFS in Linux. If you unfortunatedly face disk inaccessible problem when accessing NTFS disks, in the first case run chkdsk /f onthen reboot into Windows twice. The usage of the /f parameter iimportant!

After fixing, copy all files to another place, format the disk into ext4 or xfs, then recover the files.


GPT2 models from huggingface accept inputs_embeds as parameter of instance call and method “generate”.

Typically to adapt ViT into LLM you need LayerNorm and a linear projection layer.

You cannot put custom embedding into text generaton pipeline.


To encode token manually in GPT2:

inputs_embeds = model.transformer.wte(input_ids)

In LLaMA:

def embed_tokens(self,token_ids):
if hasattr(self.llama_model.base_model, "model"): # with lora
embeds = self.llama_model.base_model.model.model.embed_tokens(token_ids)
else:
self.llama_model.base_model.embed_tokens(token_ids)
return embeds

Install pytorch gpu with conda:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Set HF_HUB_OFFLINE=1 while loading local models, to prevent accessing network.


Set Environment="OLLAMA_MODELS=<model_storage_path>" in ollama systemd service file. Remember to change username and usergroup too, and set appropriate permission to model storage path.

In Windows, set it in system environment variables, or create a directory symlink by:

mklink /D C:\Users\<User>\.ollama\models E:\AI\Ollama\Models