2024-04-23

Python Object Value Shorthand

Use sorcery

import sorcery
a,b,c = 1,2,3
mydict = sorcery.dict_of(a,b,c)

2024-03-01

Image And Audio Feature Extraction For Language Models

Use ipython instead of python to test these code, get better parameter hints and dynamic hints, just like what you saw in brownie.

Use dataset.transform instead of dataset.map to save loading time.

Image processing

Many language models resize, reshape & pad the input image into 224x224 square and put into ViT directly.

To simplify the pipeline, we would recommend you to sample the image into fixed size square patches, like 2x2, 4x4 etc.

Or you can skip the ViT embedding part, just use Fuyu-8b or take its architecture FuyuForCausalLM and processor FuyuProcessor because it supports arbitrary sized images.

from transformers import FuyuProcessor, FuyuForCausalLM
from PIL import Image
import requests
model_name = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(model_name)
model = FuyuForCausalLM.from_pretrained(model_name)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "Generate a coco-style caption.\n"
inputs = processor(text=text_prompt, images=image, return_tensors="pt")
outputs = model(**inputs)
generated_ids = model.generate(**model_inputs, max_new_tokens=7)
generation_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generation_text)

Split image into patches

Usually images are large so we need to split.

You have three ways to split an image.

Patchify

The splited indexs are put in front instead of appended back.

import numpy as np
from patchify import patchify
image = np.random.rand(512,512,3)
patches = patchify(image, (128,128,3), step=128)
print(patches.shape) # (4, 4, 1, 128, 128, 3)

Torch `unfold`

It works by expanding target dimension and appending a new dimension corresponding to it.

import torch
image = torch.rand(512,512,3)
patches = image.unfold(0, 128, 128).unfold(1, 128, 128).unfold(2, 3, 3)
print(patches.shape) # torch.Size([4, 4, 1, 128, 128, 3])

EMPatches

import numpy as np
from empatches import EMPatches
image = np.random.rand(512, 512, 3)
emp = EMPatches()
patches, indices = emp.extract_patches(image, patchsize = 128, overlap = 0)
print(patches) # a list of numpy arrays, total 16 items
print(indices) # [(x_start, x_end, y_start, y_end), ...], total 16 items

Convert fixed-size patches into embeddings

The embeddings from ViT cannot be used directly by LLM. Instead, use LayerNorm and Dense as simple adaptors.

The first token is the class token, randomly initialized and processed along with the transformer, output as the summary of the full image, can be extracted for image embedding.

Proof:

224x224 is the shape of input image
16x16 is the patch size
224/16 = 14
14*14 + 1 = 197

import torch
import transformers
# not torch.randn (sample from normal distribution)
image = torch.rand(3, 224, 224) # chw
model_name = "google/vit-base-patch16-224-in21k"
processor = transformers.AutoImageProcessor(model_name) # for processing image
image = processor(image, do_rescale=False) # use this parameter when passing values ranging from 0 to 1
#image = processor(pil_image) # can also handle pil image
model = transformers.ViTModel(model_name)
outputs = model(pixel_values = image)
embeddings = outputs.last_hidden_state[:,0,:] # torch.Size([1, 768])

Audio processing

An useful and related field to speaker diarization in video processing is visual entity recognization, which can help you identify anime or movie characters across different frames.

When unsure, the agent shall consult online search engines, subtitles and existing recognized entities for classification. If a dataset is successfully created, one can train a YOLO model to speed up the process, used along with popular person/anime head detection models.

In most videos speakers and visuals are aligned. You can first identify speakers then get character identification. Remember you need to use special pipeline for long-time diarization, sharing speaker features for cross-audio diarization.

For multilanguage context, you would like to use speaker detection models like pyannote. Diart is a speech processing library based on that and can be used in real time, with speaker diarization, voice activity detection training pipelines.

Whisper-streaming uses LocalAgreement algoritm to segment chunks of audio and merge common patterns.

Whisper architecture is comprised of an audio encoder and transcription decoder. The output of the encoder is feed into every cross attention layer of the decoder. For feature extraction, you only need to use the encoder.

You pass single channel audio amplitude array to audio feature extractors with predetermined audio sample rate. If the sample rate mismatch, you need to resample the audio.

Different audio transformers choose different context window sizes. Like LLMs, they can be streamed. However during training they must use a fixed context size.

For Whisper, the context size is 30 seconds. Confugurable at: transformers.WhisperFeatureExtractor(chunk_length=30, ...)

For AST, it is 10.24 seconds. You can find more info about input and output sizes here. Configurable at: transformers.ASTFeatureExtractor(max_length=1024, ...)

These numbers can be found over respective processor parameters.

from transformers import AutoProcessor, ASTModel
import torch
from dataset import load_dataset
dataset_name = "hf_internal_testing/librispeech_asr_demo"
model_name = "MIT/ast-finetuned-audioset-10-10-0.4593"
dataset = load_dataset(dataset_name, 'clean', split="validation")
sampling_rate = dataset.features["audio"].sampling_rate
processor = AutoProcessor.from_pretrained(model_name)
model = ASTModel.from_pretrained(model_name)
audio_array = dataset[0].audio.array
inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
pooler_output = outputs["pooler_output"]

2023-07-05

Python Encoding Issue

windows has encoding issue on python intepreter.

run like this:

1
2
3

python -X utf8=1 <args>
# flag: sys.flags.utf8_mode

2023-02-19

Generate Docx Document From Python Docstring

install and use pdoc3

1
2
3

pip install pdoc3
pdoc --html [-o <output_dir>] <python_script_or_module_path> # default output directory of "html" is `./html`

install and use pandoc, on its homepage we find some slideshow backends like reveal.js, dzslides, s5, slideous and slidy (alternative to microsoft powerpoint, may help rendering video, or let’s use libreoffice instead? or some dedicated video editing library like moviepy)

1
2
3

# let's convert the html version of
pandoc -o <output_docx_filename> <input_html_path>

remove unwanted parts from html (beautifulsoup), and split index from main content (split and concat with docxcompose)

for composing docx from hand, use python-docx. for template based docx generating, use docxtpl

to insert page break into converted docx, there are two ways (maybe):

change css in the original html code
insert page break while concatenating

2023-02-17

Example Pydoc

description: |

API documentation for modules: example_docstring.

lang: en

classoption: oneside

geometry: margin=1in

papersize: a4

linkcolor: blue

links-as-notes: true

…

Module `example_docstring`

2023-02-10

Using Default Pypi.Org/Simple Index

packages like EdgeGPT may update overnight. mirrors won’t keep up. you need to fetch from the official package index.

to set the index:

1 2	pip set global.index-url https://pypi.org/simple

to use the index temporarily:

1 2	pip install <package> -i https://pypi.org/simple

2023-01-30

Lessons Learned From Premiere Pro Plugin Job

set up python with appropriate version on client’s computer by script, not by “PyInstaller” which takes huge amount of time to compile and huge disk space (visually).

take notes while doing work.

ask for appropriate compensation for any work.

2022-09-08

Calling Java From Python

using jpype or pyjnius

sample code for jpype:

from jpype import *
import jpype.imports # this is needed! shit.
addClassPath("/root/Desktop/works/pyjom/tests/karaoke_effects/classpath/lingua.jar")
startJVM(getDefaultJVMPath())
java.lang.System.out.println("Calling Java Print from Python using Jpype!")
from com.github.pemistahl.lingua.api import *
# detector = LanguageDetectorBuilder.fromAllLanguages().withLowAccuracyMode().build()
detector = LanguageDetectorBuilder.fromAllLanguages().build() # 3.5GB just for detecting language! it is somehow crazy.
sample = 'hello world'
result = detector.detectLanguageOf(sample)
print(result, type(result)) # <java class 'com.github.pemistahl.lingua.api.Language'>
# but we can convert it into string.
strResult = str(result)
print(strResult, type(strResult))
import math
print("CALLING MATH: %d" % math.sqrt(4))
shutdownJVM()

sample for pyjnius:

import jnius_config
# jnius_config.add_options('-Xrs', '-Xmx4096')
jnius_config.set_classpath('.', "/root/Desktop/works/pyjom/tests/karaoke_effects/classpath/lingua.jar")
import jnius
jnius.autoclass('java.lang.System').out.println('Hello world')
detector = jnius.autoclass('com.github.pemistahl.lingua.api.LanguageDetectorBuilder').fromAllLanguages().build()
sample = 'hello world'
result = detector.detectLanguageOf(sample)
print(result, type(result))
# breakpoint()
strResult = result.toString()
print(strResult, type(strResult))

2022-02-10

Python Media Automation

we first see the world, get the observation and respond in the form of content. it is a feedback loop.

to search components in videos, first take screenshots then do image search, then use the keywords to get the source video.

breakdown approach:

granualize every step, showing all possibilities to get content created and then optimoze it using standards.

filter approach:

establish some topics, create topic specific approaches to arrange the content, choose the best among all topics.

are they compatible? are you sure it is modular, scalable and extensible?

for novices, they have few unpolished ideas and waiting to realize it using code. but it lacks the feedback loop and thus you are unable to change yourself according to the reaction. breakdown approach must be used to automate the optimization, and topic based approach is simple at first hand.

to avoid copyright issues search for google.

topic based approach assues the public always have something in common and thus you only search specific things at first hand. they are easy to control, static and consistent. breakdown approach is where the evolution begins.

let’s assume our topic is about pets on weibo. pets have different kinds and the content creaters are different from each other. all we do is to download and upload. we get descriptions from our viewers, video play counts and various feedback. we improve the source by our feedback, searching for more untouched contents and more mixes like video/audio crossing.

breakdown approach is demostrated first-hand with our actor-critic model. we first view all possible posts from all sources, find what’s interesting and repost it to our target platform. this is likely to be cheating. we again choose our sources, our approach of modification based on feedback. topics are generated from the very first step.

the model of interests, which generates the topic, is the key breakdown approach. we have to eventually construct a breakdown approach to boost our searches in every aspect. feedback is one of those key features. we eventually have to view the content with the machine. suggest using the breakdown approach now.

anatomy of the post:

first thing it would be postable, according to our mandatory order. it would not be taken down or banned for a long time. banning detection is required and usually simple to test against.

second it is most profitable. we only prefer those tasks which give the most output. occasionly we choose something fresh despite lower expectations.

third it would be resourceful. consistently pinning audience in a series of videos is undoubtably competitent. this can be reached by utilizing our creativity engine based on comments and imagination, realize the unrealized.

have not yet found anything systematic on giving the full detail of such automated content creation system. we only pick up those pieces. it is important to make the entire design flexible and create miniature tests to fabricate the system. like any other famous writer/director, you could only name it but not reproduce it.

hands on the approach, no matter it is inspired by anyone or anything, it is time to begin, to complete the feedback loop.

not a pipe, but a loop.

we demonstrate the loop using fake data, then the real ones. maybe the initial topic is also meant to be fake data. the real world data is too stochastic for us to imagine. better construct something specific.

Python programming

2024-04-23

Python Object Value Shorthand

2024-03-01

Image And Audio Feature Extraction For Language Models

Image processing

Split image into patches

Patchify

Torch `unfold`

EMPatches

Convert fixed-size patches into embeddings

Audio processing

2023-07-05

Python Encoding Issue

2023-02-19

Generate Docx Document From Python Docstring

2023-02-17

Example Pydoc

Module `example_docstring`

2023-02-10

Using Default Pypi.Org/Simple Index

2023-01-30

Lessons Learned From Premiere Pro Plugin Job

2022-09-08

Calling Java From Python

2022-02-10

Python Media Automation

Links

Python programming

2024-04-23 Python Object Value Shorthand

2024-03-01 Image And Audio Feature Extraction For Language Models

Image processing

Split image into patches

Patchify

Torch unfold

EMPatches

Convert fixed-size patches into embeddings

Audio processing

2023-07-05 Python Encoding Issue

2023-02-19 Generate Docx Document From Python Docstring

2023-02-17 Example Pydoc

Module example_docstring

2023-02-10 Using Default Pypi.Org/Simple Index

2023-01-30 Lessons Learned From Premiere Pro Plugin Job

2022-09-08 Calling Java From Python

2022-02-10 Python Media Automation

Links

2024-04-23

Python Object Value Shorthand

2024-03-01

Image And Audio Feature Extraction For Language Models

Torch `unfold`

2023-07-05

Python Encoding Issue

2023-02-19

Generate Docx Document From Python Docstring

2023-02-17

Example Pydoc

Module `example_docstring`

2023-02-10

Using Default Pypi.Org/Simple Index

2023-01-30

Lessons Learned From Premiere Pro Plugin Job

2022-09-08

Calling Java From Python

2022-02-10

Python Media Automation