Python programming
2024-04-23
2024-03-01
Use ipython
instead of python
to test these code, get better parameter hints and dynamic hints, just like what you saw in brownie
.
Use dataset.transform
instead of dataset.map
to save loading time.
Image processing
Many language models resize, reshape & pad the input image into 224x224 square and put into ViT directly.
To simplify the pipeline, we would recommend you to sample the image into fixed size square patches, like 2x2, 4x4 etc.
Or you can skip the ViT embedding part, just use Fuyu-8b or take its architecture FuyuForCausalLM
and processor FuyuProcessor
because it supports arbitrary sized images.
1 | from transformers import FuyuProcessor, FuyuForCausalLM |
Split image into patches
Usually images are large so we need to split.
You have three ways to split an image.
Patchify
The splited indexs are put in front instead of appended back.
1 | import numpy as np |
Torch unfold
It works by expanding target dimension and appending a new dimension corresponding to it.
1 | import torch |
EMPatches
1 | import numpy as np |
Convert fixed-size patches into embeddings
The embeddings from ViT cannot be used directly by LLM. Instead, use LayerNorm
and Dense
as simple adaptors.
The first token is the class token, randomly initialized and processed along with the transformer, output as the summary of the full image, can be extracted for image embedding.
Proof:
1 | 224x224 is the shape of input image |
1 | import torch |
Audio processing
An useful and related field to speaker diarization in video processing is visual entity recognization, which can help you identify anime or movie characters across different frames.
When unsure, the agent shall consult online search engines, subtitles and existing recognized entities for classification. If a dataset is successfully created, one can train a YOLO model to speed up the process, used along with popular person/anime head detection models.
In most videos speakers and visuals are aligned. You can first identify speakers then get character identification. Remember you need to use special pipeline for long-time diarization, sharing speaker features for cross-audio diarization.
For multilanguage context, you would like to use speaker detection models like pyannote. Diart is a speech processing library based on that and can be used in real time, with speaker diarization, voice activity detection training pipelines.
Whisper-streaming uses LocalAgreement algoritm to segment chunks of audio and merge common patterns.
Whisper architecture is comprised of an audio encoder and transcription decoder. The output of the encoder is feed into every cross attention layer of the decoder. For feature extraction, you only need to use the encoder.
You pass single channel audio amplitude array to audio feature extractors with predetermined audio sample rate. If the sample rate mismatch, you need to resample the audio.
Different audio transformers choose different context window sizes. Like LLMs, they can be streamed. However during training they must use a fixed context size.
For Whisper, the context size is 30 seconds. Confugurable at: transformers.WhisperFeatureExtractor(chunk_length=30, ...)
For AST, it is 10.24 seconds. You can find more info about input and output sizes here. Configurable at: transformers.ASTFeatureExtractor(max_length=1024, ...)
These numbers can be found over respective processor parameters.
1 | from transformers import AutoProcessor, ASTModel |
2023-07-05
windows has encoding issue on python intepreter.
run like this:
1 | python -X utf8=1 <args> |
2023-02-19
install and use pdoc3
1 | pip install pdoc3 |
install and use pandoc
, on its homepage we find some slideshow backends like reveal.js, dzslides, s5, slideous and slidy (alternative to microsoft powerpoint, may help rendering video, or let’s use libreoffice instead? or some dedicated video editing library like moviepy)
1 | # let's convert the html version of |
remove unwanted parts from html (beautifulsoup
), and split index from main content (split and concat with docxcompose)
for composing docx from hand, use python-docx. for template based docx generating, use docxtpl
to insert page break into converted docx, there are two ways (maybe):
change css in the original html code
insert page break while concatenating
2023-02-17
description: |
API documentation for modules: example_docstring.
lang: en
classoption: oneside
geometry: margin=1in
papersize: a4
linkcolor: blue
links-as-notes: true
…
Module example_docstring
2023-02-10
packages like EdgeGPT
may update overnight. mirrors won’t keep up. you need to fetch from the official package index.
to set the index:
1 | pip set global.index-url https://pypi.org/simple |
to use the index temporarily:
1 | pip install <package> -i https://pypi.org/simple |
2023-01-30
set up python with appropriate version on client’s computer by script, not by “PyInstaller” which takes huge amount of time to compile and huge disk space (visually).
take notes while doing work.
ask for appropriate compensation for any work.
2022-09-08
sample code for jpype:
1 | from jpype import * |
sample for pyjnius:
1 | import jnius_config |
2022-02-10
we first see the world, get the observation and respond in the form of content. it is a feedback loop.
to search components in videos, first take screenshots then do image search, then use the keywords to get the source video.
breakdown approach:
granualize every step, showing all possibilities to get content created and then optimoze it using standards.
filter approach:
establish some topics, create topic specific approaches to arrange the content, choose the best among all topics.
are they compatible? are you sure it is modular, scalable and extensible?
for novices, they have few unpolished ideas and waiting to realize it using code. but it lacks the feedback loop and thus you are unable to change yourself according to the reaction. breakdown approach must be used to automate the optimization, and topic based approach is simple at first hand.
to avoid copyright issues search for google.
topic based approach assues the public always have something in common and thus you only search specific things at first hand. they are easy to control, static and consistent. breakdown approach is where the evolution begins.
let’s assume our topic is about pets on weibo. pets have different kinds and the content creaters are different from each other. all we do is to download and upload. we get descriptions from our viewers, video play counts and various feedback. we improve the source by our feedback, searching for more untouched contents and more mixes like video/audio crossing.
breakdown approach is demostrated first-hand with our actor-critic model. we first view all possible posts from all sources, find what’s interesting and repost it to our target platform. this is likely to be cheating. we again choose our sources, our approach of modification based on feedback. topics are generated from the very first step.
the model of interests, which generates the topic, is the key breakdown approach. we have to eventually construct a breakdown approach to boost our searches in every aspect. feedback is one of those key features. we eventually have to view the content with the machine. suggest using the breakdown approach now.
anatomy of the post:
first thing it would be postable, according to our mandatory order. it would not be taken down or banned for a long time. banning detection is required and usually simple to test against.
second it is most profitable. we only prefer those tasks which give the most output. occasionly we choose something fresh despite lower expectations.
third it would be resourceful. consistently pinning audience in a series of videos is undoubtably competitent. this can be reached by utilizing our creativity engine based on comments and imagination, realize the unrealized.
have not yet found anything systematic on giving the full detail of such automated content creation system. we only pick up those pieces. it is important to make the entire design flexible and create miniature tests to fabricate the system. like any other famous writer/director, you could only name it but not reproduce it.
hands on the approach, no matter it is inspired by anyone or anything, it is time to begin, to complete the feedback loop.
not a pipe, but a loop.
we demonstrate the loop using fake data, then the real ones. maybe the initial topic is also meant to be fake data. the real world data is too stochastic for us to imagine. better construct something specific.