functions:
Sentiment Analysis
Stemming
Part-of-Speech Tagging and Chunking
Phrase Extraction & Named Entity Recognition
each method is throttled to 1000 calls per day per IP.
functions:
Sentiment Analysis
Stemming
Part-of-Speech Tagging and Chunking
Phrase Extraction & Named Entity Recognition
each method is throttled to 1000 calls per day per IP.
he recently interacts with racketeers on wechat, find how to add new friends (and groups if any) on wechat.
the bilibili user and his repo
video transfer based on DCT-Net 视频洗稿 伪原创
AntiFraudChatBot is a wechaty bot using a super large model based on megatron called Yuan 1.0 which is only freely avaliable within three month (30k api calls) when applied to chat with racketeers, another application: AI剧本杀
megatron deepspeed enables training large model on cheap hardware
essaykillerbrain is another project he has involved in, which contains EssayKiller_V2 EssayKiller_V1 EssayTopicPredict WrittenBrainBase
fzf a commandline fuzzy matcher
iterfzf as a fzf python binding and its related projects
1 | from nltk.corpus import stopwords |
stopwordsiso in python
sumy Simple library and command line utility for extracting summary from HTML pages or plain texts
pytextrank Python implementation of TextRank as a spaCy pipeline extension, for graph-based natural language work plus related knowledge graph practices; used for for phrase extraction and lightweight extractive summarization of text documents
summa TextRank implementation for text summarization and keyword extraction in Python 3, with optimizations on the similarity function.
rake-nltk RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.
multi-rake Multilingual Rapid Automatic Keyword Extraction (RAKE) for Python
yake Unsupervised Approach for Automatic Keyword Extraction using Text Features
keybert uses sentence transformer to do the job
pke Python Keyphrase Extraction module
1 | import jieba.analyse as ana |
autograd and xla (Accelerated Linear Algebra)
With its updated version of Autograd, JAX can automatically differentiate native Python and NumPy functions. It can differentiate through loops, branches, recursion, and closures, and it can take derivatives of derivatives of derivatives. It supports reverse-mode differentiation (a.k.a. backpropagation) via grad as well as forward-mode differentiation, and the two can be composed arbitrarily to any order.
XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.
probabilistic programming
pyro implementation in numpy, alpha stage
machine learning in python
install official python bindings:
1 | pip install -U libsvm-official |
third-party python libsvm package installed by:
1 | pip install libsvm |
opennlp uses onnx runtime(maybe?), may support m1 inference.
opennlp is written in java. after installing openjdk on macos with homebrew, run this to ensure openjdk is detected:
1 | sudo ln -sfn $(brew --prefix)/opt/openjdk/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk |
opennlp has a language detector for 103 languages, including chinese. opennlp has a sentence detector (separator) which could be trained on chinese (maybe?)
in order to use opennlp with less code written, here’s how to invoke java from kotlin
found on mannings article about better search engine suggestions. in this example it is used with lucene, which has image retrieval (LIRE) capability. lucene is also avaliable as lucene.net in dotnet/c#.
to install lucene.net:
1 | dotnet add package Lucene.Net --prerelease |
deep learning library for java
gradient boost is used to train decision trees and classification models.
Light Gradient Boosting Machine
have official commandline tools. installation on macos:
1 | brew install lightgbm |
install python package on macos:
1 | brew install cmake |
if want to enable jax sampling, install numpyro
or blackjax
via pip
difference between pymc3 (old) and pymc (pymc4):
pymc is optimized and faster than pymc3
pymc3 use theano as backend while pymc use aesara (forked theano)
docs with live demo of pymc
PyMC is a probabilistic programming library for Python that allows users to build Bayesian models with a simple Python API and fit them using Markov chain Monte Carlo (MCMC) methods.
a high level torch wrapper including “out of the box” support for vision, text, tabular, and collab (collaborative filtering) models.
on the twitter list related to opennlp shown up on its official website, fastai has been spotted.
fastai does not support macos. or is it? fastai is on top of pytorch. initial support starts with 2.7.8 and now it is currently 2.7.9
searching ‘samoyed’ like this in github we get a dataset for pets classification called imagewoof from fastai 2020 tutorial series. more image classes like subcategories of cats may be found in imagenet.
design a model to accept fixed length word type sequence and output word order token. the token is used to decode the final word sequence, just like the convolution but different.
input can be both misplaced sentences or correct sentences
looking for english word order correctifier.(grammar)
https://github.com/MaartenGr/BERTopic
新词发现(可用于挖掘热点 热词 蓝海词)
https://github.com/zhanzecheng/Chinese_segment_augment
https://github.com/bojone/word-discovery
https://github.com/blmoistawinde/HarvestText
文本分类 文本匹配 文本检索
https://github.com/lining0806/Naive-Bayes-Classifier
https://github.com/649453932/Bert-Chinese-Text-Classification-Pytorch
https://github.com/gaussic/text-classification-cnn-rnn
https://github.com/yongzhuo/Keras-TextClassification
https://github.com/youthpasses/bayes_classifier
https://github.com/Roshanson/TextInfoExp
https://github.com/aceimnorstuvwxz/toutiao-multilevel-text-classfication-dataset
https://github.com/CementMaker/cnn_lstm_for_text_classify
https://github.com/hellonlp/classifier_multi_label_textcnn
https://github.com/cjymz886/text-cnn
https://github.com/terrifyzhao/bert-utils
https://github.com/649453932/Chinese-Text-Classification-Pytorch
https://github.com/HappyShadowWalker/ChineseTextClassify
https://github.com/XqFeng-Josie/TextCNN
questgen.ai:
generate question from essay, imitate interaction
增加观众互动性 生成问题
question answering question generator
甲骨 jiagu nlp包 provided by ownthink:
https://github.com/ownthink/Jiagu
中文分词
词性标注
命名实体识别
知识图谱关系抽取
关键词提取
文本摘要
新词发现
情感分析
文本聚类
haystack:
nlp framework
neural search neural text search
semantic search
summarization
question answering
snownlp:
chinese segmentation, pinyin, sentiment analysis (情感分析), word tags, keywords, summary, tf-idf similarity, classification, 繁体转简体
GAN Journey:
https://github.com/nutllwhy/gan-journey
NLPGNN:
https://github.com/kyzhouhzau/NLPGNN
Examples (See tests for more details):
BERT-NER (Chinese and English Version)
BERT-CRF-NER (Chinese and English Version)
BERT-CLS (Chinese and English Version)
ALBERT-NER (Chinese and English Version)
ALBERT-CLS (Chinese and English Version)
GPT2-generation (English Version)
Bilstm+Attention (Chinese and English Version)
TextCNN(Chinese and English Version)
GCN, GAN, GIN, GraphSAGE (Base on message passing)
TextGCN and TextSAGE for text classification
using python:
https://github.com/R0uter/LoginputEngine
pinyin2hanzi:
https://github.com/letiantian/Pinyin2Hanzi
Python chinese to pinyin:
https://github.com/mozillazg/python-pinyin
pyim tsinghua dict(for emacs):
https://github.com/redguardtoo/pyim-tsinghua-dict
chinese input method dict converter: