2023-02-14

Text-Processing.Com, Free Text Mining And Natural Language Processing

functions:

Sentiment Analysis
Stemming
Part-of-Speech Tagging and Chunking
Phrase Extraction & Named Entity Recognition

each method is throttled to 1000 calls per day per IP.

2022-12-13

Turing-Project And His Works On Ai And Nlp

he recently interacts with racketeers on wechat, find how to add new friends (and groups if any) on wechat.

video transfer based on DCT-Net 视频洗稿伪原创

AntiFraudChatBot is a wechaty bot using a super large model based on megatron called Yuan 1.0 which is only freely avaliable within three month (30k api calls) when applied to chat with racketeers, another application: AI剧本杀

megatron deepspeed enables training large model on cheap hardware

essaykillerbrain is another project he has involved in, which contains EssayKiller_V2 EssayKiller_V1 EssayTopicPredict WrittenBrainBase

alphafold in mindspore

language models

allennlp-models

bert lang street

recommendation

deepmatch

fuzzy search

fuzzywuzzy or thefuzz

fzf a commandline fuzzy matcher

iterfzf as a fzf python binding and its related projects

rapidfuzz

stopwords

1 2	from nltk.corpus import stopwords

stopwordsiso in python

summarization

sumy Simple library and command line utility for extracting summary from HTML pages or plain texts

pytextrank Python implementation of TextRank as a spaCy pipeline extension, for graph-based natural language work plus related knowledge graph practices; used for for phrase extraction and lightweight extractive summarization of text documents

summa TextRank implementation for text summarization and keyword extraction in Python 3, with optimizations on the similarity function.

keyword extraction

rake-nltk RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

multi-rake Multilingual Rapid Automatic Keyword Extraction (RAKE) for Python

yake Unsupervised Approach for Automatic Keyword Extraction using Text Features

tutorial and libraries

keybert uses sentence transformer to do the job

kwx

pke Python Keyphrase Extraction module

import jieba.analyse as ana
# methods under ana:
# ['analyzer', 'default_textrank', 'default_tfidf', 'extract_tags', 'set_idf_path', 'set_stop_words', 'textrank', 'tfidf']

jax

docs

autograd and xla (Accelerated Linear Algebra)

With its updated version of Autograd, JAX can automatically differentiate native Python and NumPy functions. It can differentiate through loops, branches, recursion, and closures, and it can take derivatives of derivatives of derivatives. It supports reverse-mode differentiation (a.k.a. backpropagation) via grad as well as forward-mode differentiation, and the two can be composed arbitrarily to any order.

XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.

pyro

probabilistic programming

getting started

examples

sample code

numpyro

getting started

pyro implementation in numpy, alpha stage

scikit-learn

machine learning in python

libsvm

install official python bindings:

1 2	pip install -U libsvm-official

third-party python libsvm package installed by:

1 2	pip install libsvm

opennlp

hands-on docs

model zoo

opennlp uses onnx runtime(maybe?), may support m1 inference.

opennlp is written in java. after installing openjdk on macos with homebrew, run this to ensure openjdk is detected:

1 2	sudo ln -sfn $(brew --prefix)/opt/openjdk/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk

opennlp has a language detector for 103 languages, including chinese. opennlp has a sentence detector (separator) which could be trained on chinese (maybe?)

in order to use opennlp with less code written, here’s how to invoke java from kotlin

dl4j

found on mannings article about better search engine suggestions. in this example it is used with lucene, which has image retrieval (LIRE) capability. lucene is also avaliable as lucene.net in dotnet/c#.

to install lucene.net:

1 2	dotnet add package Lucene.Net --prerelease

deep learning library for java

xgboost

gradient boost is used to train decision trees and classification models.

lightgbm

Light Gradient Boosting Machine

have official commandline tools. installation on macos:

1 2	brew install lightgbm

install python package on macos:

1
2
3

brew install cmake
pip3 install lightgbm

pymc

examples

if want to enable jax sampling, install numpyro or blackjax via pip

difference between pymc3 (old) and pymc (pymc4):

pymc is optimized and faster than pymc3

pymc3 use theano as backend while pymc use aesara (forked theano)

docs with live demo of pymc

PyMC is a probabilistic programming library for Python that allows users to build Bayesian models with a simple Python API and fit them using Markov chain Monte Carlo (MCMC) methods.

fastai

a high level torch wrapper including “out of the box” support for vision, text, tabular, and collab (collaborative filtering) models.

docs

courses

on the twitter list related to opennlp shown up on its official website, fastai has been spotted.

fastai does not support macos. or is it? fastai is on top of pytorch. initial support starts with 2.7.8 and now it is currently 2.7.9

searching ‘samoyed’ like this in github we get a dataset for pets classification called imagewoof from fastai 2020 tutorial series. more image classes like subcategories of cats may be found in imagenet.

2022-07-18

Sentence Word Order Corrector

design a model to accept fixed length word type sequence and output word order token. the token is used to decode the final word sequence, just like the convolution but different.

input can be both misplaced sentences or correct sentences

looking for english word order correctifier.(grammar)