2023-12-20

Rag In My Mind

rankrag

standford storm 2.0 writer

togethercomputer moa

starrag

hipporag

https://github.com/microsoft/graphrag

https://github.com/danielmiessler/fabric

https://github.com/infiniflow/ragflow

https://github.com/Jenqyang/LLM-Powered-RAG-System

https://github.com/lamini-ai/Lamini-Memory-Tuning

llm generate images for content

llm generate tags & categories for content

llm generate embedding for content

llm generate query words

llm generate query image/audio

system perform full text search

system perform vector search

llm generate relevance or preference

llm generate potential query for content

system update relevance based on llm preference

2022-12-06

chatgpt

GPT4 is out.

三个国内镜像站：

https://chat.forchange.cn

https://aigcfun.com

https://ai.askai.top

besides from decent processors, RAM and optimized runtime, in order to load LLMs fast, one would store the model weights on SSDs.

now colossalai supports chatgpt training with a single gpu, using open-source code

check humata for paper QA and information extraction/language understanding from PDF files

the syntax of chatgpt’s response is obviously markdown.

in order to be unblocked by chatgpt just because we are using static ip of corp’s wifi, we can connect through our phone’s hotspot.

Microsoft’s EdgeGPT needs you to open in Edge browser and join the waitlist of new Bing, having 3rd party API here

Merlin is an extension based on ChatGPT which is avaliable for free and all countries, with 11 queries for free each day. Pro subscriptions incoming.

Rallio67 builds dataset for RLHF and has released multiple chatgpt-like models on huggingface. namely, joi, chip and rosey, all based on pythia or neox-20b. laion people tend to share loads to CPU in order to run these huge models properly.

KoboldAI considered OPT and GPT-Neo as generic LMs. special models like NSFW shits may serve some purposes better.

many alternatives, but many are specialized in marketing and content generation, some are chatgpt replica, like chatsonic (with google knowledge) and youchat (from you.com (awesome!))

open assistant now has a data collection website, in which you can only perform tasks given and earn points (working for free? nah?)

it is adviced to run this chatgpt program with libraries instead of manually, to prevent issues.

my account has been banned from trying chatgpt. though it is not going to be free forever, you need to moderate your input (multi-language support, not only english but chinese) using some api to prevent similar incidents. also some topics outside of blacklist are banned intentionally so you need to check if the model is really producing the answer. if not you should avoid or change the way of asking it.

moderation via official openai api, perspective api (free), or via some projects like content moderation deeplearning, bert text moderation, bert-base-uncased-hatexplain, toxic-bert, copilot-toxicity and multilingual-hate-speech-robacofi, train on datasets like hate_speech_offensive, toxicity (by surge-ai, a dataset labelling workforce) and multilingual-hate-speech

from my point of view, this is a service you cannot replicate at home, either requires smaller models with different architecture, or requires crowd-sourced computational power.

saying chatgpt is powered by ray, increasing parallelism.

bigscience petals colab and petals repo

discord chatroom for reproducing chatgpt

since many different models are derived from the original pretrained language model, opendelta can save disk space by freezing main parameters, only tuning few of them.

this gpt seems really good. currently only api access.

but it is provided by openai which is no longer so “open” in the sense of “open-source”.

stability.ai is providing alternative open-source implementations of SOTA AI algorithms, which includes carper.ai, eleuther.ai, dreamstudio, harmonai (audio), laion.ai (datasets and projects)

viable approaches to chatgpt

according to my point of view, chatgpt is just specialized on chat, or socialized in other words.

the elo rating system is the key to facebook social network, many zero-sum games. basically it is some revolution rating system. to do such rating system effectively one shall use along with classifiers and embeddings.

according to the training process of instructgpt and webgpt, we know that gpt has learned more by interacting with people (multiple QA), doing self-examination (learning a reward model) and performing actions (searching and quoting on web).

RLHF

chainer, prompt engineering

awesome chatgpt prompts

langchain extending llm by advanced prompts, llm wrappers actions, databases and memories

RL algorithms, tools for providing feedback

Awesome-RLHF paper and code about RLHF

openai baselines

stable-baselines 3

SetFit

Efficient few-shot learning with Sentence Transformers, used by FewShotRLGPT (no updates till now?)

RLHF models

non-language models

image_to_text_rlhf

algorithm-distillation-rlhf

language models

chatrwkv pure rnn language model, with chinese support

lamda-rlhf-chatgpt

blenderbot2 a bot which can search internet, blenderbot3 is US only. install ParlAI then clone ParlAI_SearchEngine. tutorial

promptCLUE based on T5, created by clueai, trained on pCLUE

openassistant

openchatgpt-neox-125m trained on chatgpt prompts, can be tested here, trained from pythia

copycat chatgpt replicate

medicine-chatgpt shit sick of COVID-19

baby-rlhf both cartpole and languge model

rlhf-shapespeare

textrl 100+stars

PaLM-RLHF claims RETRO will be integrated soon?

RL4LMs with multiple rl methods

minRLHF

webgpt-cli interface openai api to browse web and answer questions

lm-human-preferences by openai

rlhf-magic using trlx (supports GPT3-like models) which has PPO and ILQL (as trainable model)

trl only has PPO on GPT2

Tk-Instruct T5 trained on natural instruct dataset. is it trained on RLHF systems?

datasets

whisperhub collection of chatgpt prompts by plugin

hh-rlhf

instructgpt samples

natural instructions

dataset building tools

open-chatgpt-prompt-collective

crowd-kit purify noisy data

promptsource

reward models

rankgen scores model generations given a prefix (or prompt)

electra-webgpt-rm and electra-large-reward-model is based on electra discriminator

GPT3-like models

galactica is opt trained on scientific data

bloomz and mt0 trained on xP3 (multilingual prompts and code)

T0PP T0 optimized for zero-shot prompts, despite much smaller than GPT-3

RETRO another model with GPT-3 capabilities with fewer parameters?

gpt3 is gpt2 with sparse attension, which enables it to generate long sequence

Diffusion-LM

PaLM

metaseq provides OPT, which is basically GPT3

GPT-JT altered in many ways, trained on natural instructions huggingface space

GPT-Neo

GPT-J

GPT-NeoX

Bloom large language model by bigscience

autonomous learning

autonomous-learning-library doc and repo

Gu-X doing god-knows-what experiments

analysis about how to make such model

gpt3 is capable of imitation (cause it is unsupervised.)

but! if you want to get things done (when you really need it!), you better want some aligned AI.

two similar models by openai: webgpt and instructgpt

about instructgpt

it is first fine-tuned on supervised datasets, then train some reward model, then use the reward model to handle prompts and do reinforcement learning with PPO.

details on webgpt environment

guess: create states by performing actions, then generate templates to allow model filling blanks.

Our text-based web-browsing environment is written mostly in Python with some JavaScript. For a
high-level overview, see Section 2. Further details are as follows:
• When a search is performed, we send the query to the Microsoft Bing Web Search API, and
convert this to a simplified web page of results.
• When a link to a new page is clicked, we call a Node.js script that fetches the HTML of the
web page and simplifies it using Mozilla’s Readability.js.
• We remove any search results or links to reddit.com or quora.com, to prevent the model
copying answers from those sites.
• We take the simplified HTML and convert links to the special format
【<link ID>†<link text>†<destination domain>】, or
【<link ID>†<link text>】 if the destination and source domains are the same. Here,
the link ID is the index of the link on the page, which is also used for the link-clicking
command. We use special characters such as 【 and 】 because they are rare and encoded
in the same few ways by the tokenizer, and if they appear in the page text then we replace
them by similar alternatives.
• We convert superscripts and subscripts to text using ^ and _, and convert images to the
special format [Image: <alt text>], or [Image] if there is no alt text.
• We convert the remaining HTML to text using html2text.
• For text-based content types other than HTML, we use the raw text. For PDFs, we convert
them to text using pdfminer.six. For all other content types, and for errors and timeouts, we
use an error message.
• We censor any pages that contain a 10-gram overlap with the question (or reference answer,
if provided) to prevent the model from cheating, and use an error message instead.
• We convert the title of the page to text using the format <page title> (<page domain>).
For search results pages, we use Search results for: <query>.
• When a find in page or quote action is performed, we compare the text from the command
against the page text with any links stripped (i.e., including only the text from each link).
We also ignore case. For quoting, we also ignore whitespace, and allow the abbreviated
format <start text>━<end text> to save tokens.
• During browsing, the state of the browser is converted to text as shown in Figure 1(b).
For the answering phase (the last step of the episode), we convert the question to
text using the format <question>■, and follow this by each of the collected quotes
in the format [<quote number>] <quote page title> (<quote page domain>)
<double new line><quote extract>■.

voice assistants

voice assistant in cpp

ChatWaifu with anime voice, ChatWaifu with live2d

hacking

give longterm memory and external resources to gpt3

write backend logic with gpt

hackgpt exploit vulnerabilities

vulchatgpt ida plugin for reverse engineering

chatgpt-universe things related to chatgpt

galgame using chatgpt

记笔记

12.27更新了一个更精简的应用

强烈建议部署到服务器上

huggingface参考：https://huggingface.co/spaces/Mahiruoshi/Lovelive-Nijigasaku-Chat-iSTFT-GPT3

GitHub：https://github.com/Paraworks/vits_with_chatgpt-gpt3

地址：https://drive.google.com/drive/folders/1vtootVMQ7wTOQwd15nJe6akzJUYNOw4d?usp=share_link

你可以先尝试在服务器上部署，之后可以直接解压进文件夹后运行exe（mac、安卓端需要用renpy自行编译）

去https://beta.openai.com/account/api-keys获取api-key

参数照着敲就好了

人物id通常是从0开始的数字，我的模型最大到12

api部署方法：把inference_api.py放入你的vits目录下，进入文件修改config和checkpoint.pth的路径，比起应用程序来说十分简单，可以自行设计。码龄三个月写出的的雪山代码警告

——————————————————————————————————————————————————

Chatgpt部署方法已于12.26更新（视频后部分）

vits参考：https://github.com/CjangCjengh/vits

服务器端建议用ISTFT VITS：https://github.com/innnky/MB-iSTFT-VITS

model库：https://github.com/CjangCjengh/TTSModels

也可以用我的https://huggingface.co/spaces/Mahiruoshi/MIT-VITS-Nijigaku

CHATGPT参考：https://github.com/rawandahmad698/PyChatGPT

示例视频（纯服务器api，gpt3）https://www.bilibili.com/video/BV1hP4y1B7wH/?spm_id_from=333.999.0.0&vd_source=7e8cf9f5c840ec4789ccb5657b2f0512

穗乃果配音来自缪斯的模型@Freeze_Phoenix

gpt3加载参考@ぶらぶら散策中

chatgpt use cases curated list

DAILA use chatgpt

to identify function calls in decompiler

awesome transformer language models a huge collection on transformer based LMs, huge models by megacorps, with some introduction and analogy on chatgpt

huggingface blog on RLHF containing similar projects and source code

bilibili sends me lots of videos (and articles) on hacking and ai (including chatgpt) via its android app. recommend you to scrape this source and collect transcription and screenshots for searching and content generation.

b站有做免杀绕过杀软的

chatgpt原理解析

chatgpt对接搜索引擎

下载链接:

github: https://github.com/josStorer/chat-gpt-search-engine-extension/releases/

百度网盘: https://pan.baidu.com/s/1MnFJTDIatyIIPr5kUMWsAw?pwd=1111

提取码：1111

原项目: https://github.com/wong2/chat-gpt-google-extension

我创建的fork, 添加了多个搜索引擎支持的版本: https://github.com/josStorer/chat-gpt-search-engine-extension

PR: https://github.com/wong2/chat-gpt-google-extension/pull/31

已修复先前百度需要手动刷新的问题

access via api

https://github.com/altryne/chatGPT-telegram-bot

https://github.com/taranjeet/chatgpt-api

https://github.com/acheong08/ChatGPT

https://github.com/vincelwt/chatgpt-mac

https://github.com/transitive-bullshit/chatgpt-api

https://github.com/rawandahmad698/PyChatGPT

models like chatgpt

lfqa retrival based generative QA

lm-human-preferences by openai

trl Train transformer language models with reinforcement learning based on gpt2

trlx A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF) by CarperAI

RL4LMs A modular RL library to fine-tune language models to human preferences

PaLM-rlhf-pytorch saying this is basically chatgpt with palm

gpt-gmlp saying this design integrates gpt with gmlps so will use less ram and can be trained on a single gpu

WebGPT

tk-instruct with all models by allenai can be multilingual, trained on natural instructions

there’s a ghosted repo named instructgpt-pytorch found in bing but no cache preserved, also an empty repo called InstructFNet wtf?

AidMe Code and experiment of the article AidMe User-in-the-loop Adaptative Intent Detecttion for Instructable Digital Assistant

cheese Used for adaptive human in the loop evaluation of language and embedding models.

Kelpie Explainable AI framework for interpreting Link Predictions on Knowledge Graphs

GrIPS Gradient-free, Edit-based Instruction Search for Prompting Large Language Models

queakily nlp datasets cleaner

gpt-j

super big bilingual model GLM-130B

multi-modal deeplearning paper collections

bloom a huge model like gpt-3

notice, gpt-2 is somehow inferior to gpt-3 since it has smaller model parameters

dialogue-generation Generating responses with pretrained XLNet and GPT-2 in PyTorch.

personaGPT Implementation of PersonaGPT Dialog Model

DialoGPT Large-scale pretraining for dialogue

language models

allennlp-models

bert lang street

recommendation

deepmatch

fuzzy search

fuzzywuzzy or thefuzz

fzf a commandline fuzzy matcher

iterfzf as a fzf python binding and its related projects

rapidfuzz

stopwords

1 2	from nltk.corpus import stopwords

stopwordsiso in python

summarization

sumy Simple library and command line utility for extracting summary from HTML pages or plain texts

pytextrank Python implementation of TextRank as a spaCy pipeline extension, for graph-based natural language work plus related knowledge graph practices; used for for phrase extraction and lightweight extractive summarization of text documents

summa TextRank implementation for text summarization and keyword extraction in Python 3, with optimizations on the similarity function.

keyword extraction

rake-nltk RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

multi-rake Multilingual Rapid Automatic Keyword Extraction (RAKE) for Python

yake Unsupervised Approach for Automatic Keyword Extraction using Text Features

tutorial and libraries

keybert uses sentence transformer to do the job

kwx

pke Python Keyphrase Extraction module

import jieba.analyse as ana
# methods under ana:
# ['analyzer', 'default_textrank', 'default_tfidf', 'extract_tags', 'set_idf_path', 'set_stop_words', 'textrank', 'tfidf']

2022-08-08

识别视频语言

speechbrain has features of Speech Recognition, Speaker Recognition, Speech Enhancement, Speech Processing, Multi Microphone Processing, Text-to-Speech, and also supports Spoken Language Understanding, Language Modeling, Diarization, Speech Translation, Language Identification, Voice Activity Detection, Sound classification, Grapheme-to-Phoneme, and many others.

概述

视频里面的语言分为图片上面打出来的字幕以及人说的话

涉及到的问题分别为：图片文字的语言分类以及音频语言分类

音频识别

online speech recognition

pip install SpeechRecognition

offline, need to provide language id:

https://pypi.org/project/automatic-speech-recognition/

use paddlespeech if possible, for chinese and english

图片语言识别

use google cloud to detect language type in image:

https://github.com/deduced/ml-ocr-lang-detection

Detects and Recognizes text and font language in an image

https://github.com/JAIJANYANI/Language-Detection-in-Image

图片语言文字分类可以用easyocr实现加载多个模型比如中文加英文加日语 b站其他语言的可能也不怎么受欢迎最多再加韩语

可以从视频简介标题链接里面提取出句子每个句子进行语言分类确定要使用的OCR模型也有可能出现描述语言和视频图片文字语言不一致的情况

wolfram language提供了一个图片分类器分类出来的结果可能很有意思可以结合苹果的图片关注区域生成器来结合使用

ImageIdentify[pictureObj]

这个方法还支持subcategory分类支持多输出具体看文档

https://www.imageidentify.com/about/how-it-works

wolfram支持cloud deploy 到wolfram cloud不过那样可能不行

文本语言识别分类

lingua performs good in short text, can be used in java or kotlin

supporting detecting different languages:

cld2 containing useful vectors containing text spans python binding

>>> import pycld2 as cld2
>>> text_content = """ A accès aux chiens et aux frontaux qui lui ont été il peut consulter et modifier ses collections et exporter Cet article concerne le pays européen aujourd’hui appelé République française.
Pour d’autres usages du nom France, Pour une aide rapide et effective, veuiller trouver votre aide dans le menu ci-dessus.
Welcome, to this world of Data Scientist. Today is a lovely day."""
>>> _, _, _, detected_language = cld2.detect(text_content,  returnVectors=True)
>>> print(detected_language)
((0, 323, 'FRENCH', 'fr'), (323, 64, 'ENGLISH', 'en'))

original cld3 is designed for chromium and it relies on chromium code to run

official cld3 python bindings

additional Python language related library from geeksforgeeks:

textblob is a natural language processing toolkit

from textblob import TextBlob
text = "это компьютерный портал для гиков. It was a beautiful day ."
lang = TextBlob(text)
print(lang.detect_language())
# ru

langid performs good in short text

textcat (r package)

google language detection library in python: langdetect

javascript:

https://github.com/wooorm/franc

python version of franc:

pyfranc

wlatlang.org provides whatlang-rs as rust package, also whatlang-py as python bindings

2022-07-18

Sentence Word Order Corrector

design a model to accept fixed length word type sequence and output word order token. the token is used to decode the final word sequence, just like the convolution but different.

input can be both misplaced sentences or correct sentences

looking for english word order correctifier.(grammar)