KoboldAI considered OPT and GPT-Neo as generic LMs. special models like NSFW shits may serve some purposes better.
many alternatives, but many are specialized in marketing and content generation, some are chatgpt replica, like chatsonic (with google knowledge) and youchat (from you.com (awesome!))
open assistant now has a data collection website, in which you can only perform tasks given and earn points (working for free? nah?)
it is adviced to run this chatgpt program with libraries instead of manually, to prevent issues.
my account has been banned from trying chatgpt. though it is not going to be free forever, you need to moderate your input (multi-language support, not only english but chinese) using some api to prevent similar incidents. also some topics outside of blacklist are banned intentionally so you need to check if the model is really producing the answer. if not you should avoid or change the way of asking it.
from my point of view, this is a service you cannot replicate at home, either requires smaller models with different architecture, or requires crowd-sourced computational power.
saying chatgpt is powered by ray, increasing parallelism.
since many different models are derived from the original pretrained language model, opendelta can save disk space by freezing main parameters, only tuning few of them.
this gpt seems really good. currently only api access.
but it is provided by openai which is no longer so “open” in the sense of “open-source”.
according to my point of view, chatgpt is just specialized on chat, or socialized in other words.
the elo rating system is the key to facebook social network, many zero-sum games. basically it is some revolution rating system. to do such rating system effectively one shall use along with classifiers and embeddings.
according to the training process of instructgpt and webgpt, we know that gpt has learned more by interacting with people (multiple QA), doing self-examination (learning a reward model) and performing actions (searching and quoting on web).
gpt3 is capable of imitation (cause it is unsupervised.)
but! if you want to get things done (when you really need it!), you better want some aligned AI.
two similar models by openai: webgpt and instructgpt
about instructgpt
it is first fine-tuned on supervised datasets, then train some reward model, then use the reward model to handle prompts and do reinforcement learning with PPO.
details on webgpt environment
guess: create states by performing actions, then generate templates to allow model filling blanks.
Our text-based web-browsing environment is written mostly in Python with some JavaScript. For a high-level overview, see Section 2. Further details are as follows: • When a search is performed, we send the query to the Microsoft Bing Web Search API, and convert this to a simplified web page of results. • When a link to a new page is clicked, we call a Node.js script that fetches the HTML of the web page and simplifies it using Mozilla’s Readability.js. • We remove any search results or links to reddit.com or quora.com, to prevent the model copying answers from those sites. • We take the simplified HTML and convert links to the special format 【<link ID>†<link text>†<destination domain>】, or 【<link ID>†<link text>】 if the destination and source domains are the same. Here, the link ID is the index of the link on the page, which is also used for the link-clicking command. We use special characters such as 【 and 】 because they are rare and encoded in the same few ways by the tokenizer, and if they appear in the page text then we replace them by similar alternatives. • We convert superscripts and subscripts to text using ^ and _, and convert images to the special format [Image: <alt text>], or [Image] if there is no alt text. • We convert the remaining HTML to text using html2text. • For text-based content types other than HTML, we use the raw text. For PDFs, we convert them to text using pdfminer.six. For all other content types, and for errors and timeouts, we use an error message. • We censor any pages that contain a 10-gram overlap with the question (or reference answer, if provided) to prevent the model from cheating, and use an error message instead. • We convert the title of the page to text using the format <page title> (<page domain>). For search results pages, we use Search results for: <query>. • When a find in page or quote action is performed, we compare the text from the command against the page text with any links stripped (i.e., including only the text from each link). We also ignore case. For quoting, we also ignore whitespace, and allow the abbreviated format <start text>━<end text> to save tokens. • During browsing, the state of the browser is converted to text as shown in Figure 1(b). For the answering phase (the last step of the episode), we convert the question to text using the format <question>■, and follow this by each of the collected quotes in the format [<quote number>] <quote page title> (<quote page domain>) <double new line><quote extract>■.
awesome transformer language models a huge collection on transformer based LMs, huge models by megacorps, with some introduction and analogy on chatgpt
bilibili sends me lots of videos (and articles) on hacking and ai (including chatgpt) via its android app. recommend you to scrape this source and collect transcription and screenshots for searching and content generation.
sumy Simple library and command line utility for extracting summary from HTML pages or plain texts
pytextrank Python implementation of TextRank as a spaCy pipeline extension, for graph-based natural language work plus related knowledge graph practices; used for for phrase extraction and lightweight extractive summarization of text documents
summa TextRank implementation for text summarization and keyword extraction in Python 3, with optimizations on the similarity function.
keyword extraction
rake-nltk RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.
multi-rake Multilingual Rapid Automatic Keyword Extraction (RAKE) for Python
yake Unsupervised Approach for Automatic Keyword Extraction using Text Features
speechbrain has features of Speech Recognition, Speaker Recognition, Speech Enhancement, Speech Processing, Multi Microphone Processing, Text-to-Speech, and also supports Spoken Language Understanding, Language Modeling, Diarization, Speech Translation, Language Identification, Voice Activity Detection, Sound classification, Grapheme-to-Phoneme, and many others.
>>> import pycld2 as cld2 >>> text_content = """ A accès aux chiens et aux frontaux qui lui ont été il peut consulter et modifier ses collections et exporter Cet article concerne le pays européen aujourd’hui appelé République française. Pour d’autres usages du nom France, Pour une aide rapide et effective, veuiller trouver votre aide dans le menu ci-dessus. Welcome, to this world of Data Scientist. Today is a lovely day.""" >>> _, _, _, detected_language = cld2.detect(text_content, returnVectors=True) >>> print(detected_language) ((0, 323, 'FRENCH', 'fr'), (323, 64, 'ENGLISH', 'en'))
original cld3 is designed for chromium and it relies on chromium code to run
from textblob import TextBlob text = "это компьютерный портал для гиков. It was a beautiful day ." lang = TextBlob(text) print(lang.detect_language()) # ru
design a model to accept fixed length word type sequence and output word order token. the token is used to decode the final word sequence, just like the convolution but different.
input can be both misplaced sentences or correct sentences
looking for english word order correctifier.(grammar)