Use ipython
instead of python
to test these code, get better parameter hints and dynamic hints, just like what you saw in brownie
.
Use dataset.transform
instead of dataset.map
to save loading time.
Image processing
Many language models resize, reshape & pad the input image into 224x224 square and put into ViT directly.
To simplify the pipeline, we would recommend you to sample the image into fixed size square patches, like 2x2, 4x4 etc.
Or you can skip the ViT embedding part, just use Fuyu-8b or take its architecture FuyuForCausalLM
and processor FuyuProcessor
because it supports arbitrary sized images.
1 | from transformers import FuyuProcessor, FuyuForCausalLM |
Split image into patches
Usually images are large so we need to split.
You have three ways to split an image.
Patchify
The splited indexs are put in front instead of appended back.
1 | import numpy as np |
Torch unfold
It works by expanding target dimension and appending a new dimension corresponding to it.
1 | import torch |
EMPatches
1 | import numpy as np |
Convert fixed-size patches into embeddings
The embeddings from ViT cannot be used directly by LLM. Instead, use LayerNorm
and Dense
as simple adaptors.
The first token is the class token, randomly initialized and processed along with the transformer, output as the summary of the full image, can be extracted for image embedding.
Proof:
1 | 224x224 is the shape of input image |
1 | import torch |
Audio processing
An useful and related field to speaker diarization in video processing is visual entity recognization, which can help you identify anime or movie characters across different frames.
When unsure, the agent shall consult online search engines, subtitles and existing recognized entities for classification. If a dataset is successfully created, one can train a YOLO model to speed up the process, used along with popular person/anime head detection models.
In most videos speakers and visuals are aligned. You can first identify speakers then get character identification. Remember you need to use special pipeline for long-time diarization, sharing speaker features for cross-audio diarization.
For multilanguage context, you would like to use speaker detection models like pyannote. Diart is a speech processing library based on that and can be used in real time, with speaker diarization, voice activity detection training pipelines.
Whisper-streaming uses LocalAgreement algoritm to segment chunks of audio and merge common patterns.
Whisper architecture is comprised of an audio encoder and transcription decoder. The output of the encoder is feed into every cross attention layer of the decoder. For feature extraction, you only need to use the encoder.
You pass single channel audio amplitude array to audio feature extractors with predetermined audio sample rate. If the sample rate mismatch, you need to resample the audio.
Different audio transformers choose different context window sizes. Like LLMs, they can be streamed. However during training they must use a fixed context size.
For Whisper, the context size is 30 seconds. Confugurable at: transformers.WhisperFeatureExtractor(chunk_length=30, ...)
For AST, it is 10.24 seconds. You can find more info about input and output sizes here. Configurable at: transformers.ASTFeatureExtractor(max_length=1024, ...)
These numbers can be found over respective processor parameters.
1 | from transformers import AutoProcessor, ASTModel |