2023-02-14
Text-Processing.Com, Free Text Mining And Natural Language Processing

functions:

  • Sentiment Analysis

  • Stemming

  • Part-of-Speech Tagging and Chunking

  • Phrase Extraction & Named Entity Recognition

each method is throttled to 1000 calls per day per IP.

Read More

2022-12-13
Turing-Project And His Works On Ai And Nlp

he recently interacts with racketeers on wechat, find how to add new friends (and groups if any) on wechat.

the bilibili user and his repo

video transfer based on DCT-Net 视频洗稿 伪原创

AntiFraudChatBot is a wechaty bot using a super large model based on megatron called Yuan 1.0 which is only freely avaliable within three month (30k api calls) when applied to chat with racketeers, another application: AI剧本杀

megatron deepspeed enables training large model on cheap hardware

essaykillerbrain is another project he has involved in, which contains EssayKiller_V2 EssayKiller_V1 EssayTopicPredict WrittenBrainBase

alphafold in mindspore

Read More

2022-10-29
Keyword Extraction, Topic Modeling, Sentence Embedding

language models

allennlp-models

bert lang street

recommendation

deepmatch

fuzzywuzzy or thefuzz

fzf a commandline fuzzy matcher

iterfzf as a fzf python binding and its related projects

rapidfuzz

stopwords

1
2
from nltk.corpus import stopwords

stopwordsiso in python

summarization

sumy Simple library and command line utility for extracting summary from HTML pages or plain texts

pytextrank Python implementation of TextRank as a spaCy pipeline extension, for graph-based natural language work plus related knowledge graph practices; used for for phrase extraction and lightweight extractive summarization of text documents

summa TextRank implementation for text summarization and keyword extraction in Python 3, with optimizations on the similarity function.

keyword extraction

rake-nltk RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

multi-rake Multilingual Rapid Automatic Keyword Extraction (RAKE) for Python

yake Unsupervised Approach for Automatic Keyword Extraction using Text Features

tutorial and libraries

keybert uses sentence transformer to do the job

kwx

pke Python Keyphrase Extraction module

1
2
3
4
import jieba.analyse as ana
# methods under ana:
# ['analyzer', 'default_textrank', 'default_tfidf', 'extract_tags', 'set_idf_path', 'set_stop_words', 'textrank', 'tfidf']

Read More

2022-08-07
Opennlp, Fastai And Other Machine Learning Platforms

jax

docs

autograd and xla (Accelerated Linear Algebra)

With its updated version of Autograd, JAX can automatically differentiate native Python and NumPy functions. It can differentiate through loops, branches, recursion, and closures, and it can take derivatives of derivatives of derivatives. It supports reverse-mode differentiation (a.k.a. backpropagation) via grad as well as forward-mode differentiation, and the two can be composed arbitrarily to any order.

XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.

pyro

probabilistic programming

getting started

examples

sample code

numpyro

getting started

pyro implementation in numpy, alpha stage

scikit-learn

machine learning in python

libsvm

install official python bindings:

1
2
pip install -U libsvm-official

third-party python libsvm package installed by:

1
2
pip install libsvm

opennlp

hands-on docs

model zoo

opennlp uses onnx runtime(maybe?), may support m1 inference.

opennlp is written in java. after installing openjdk on macos with homebrew, run this to ensure openjdk is detected:

1
2
sudo ln -sfn $(brew --prefix)/opt/openjdk/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk

opennlp has a language detector for 103 languages, including chinese. opennlp has a sentence detector (separator) which could be trained on chinese (maybe?)

in order to use opennlp with less code written, here’s how to invoke java from kotlin

dl4j

found on mannings article about better search engine suggestions. in this example it is used with lucene, which has image retrieval (LIRE) capability. lucene is also avaliable as lucene.net in dotnet/c#.

to install lucene.net:

1
2
dotnet add package Lucene.Net --prerelease

deep learning library for java

xgboost

gradient boost is used to train decision trees and classification models.

lightgbm

Light Gradient Boosting Machine

have official commandline tools. installation on macos:

1
2
brew install lightgbm

install python package on macos:

1
2
3
brew install cmake
pip3 install lightgbm

pymc

examples

if want to enable jax sampling, install numpyro or blackjax via pip

difference between pymc3 (old) and pymc (pymc4):

pymc is optimized and faster than pymc3

pymc3 use theano as backend while pymc use aesara (forked theano)

docs with live demo of pymc

PyMC is a probabilistic programming library for Python that allows users to build Bayesian models with a simple Python API and fit them using Markov chain Monte Carlo (MCMC) methods.

fastai

a high level torch wrapper including “out of the box” support for vision, text, tabular, and collab (collaborative filtering) models.

docs

courses

on the twitter list related to opennlp shown up on its official website, fastai has been spotted.

fastai does not support macos. or is it? fastai is on top of pytorch. initial support starts with 2.7.8 and now it is currently 2.7.9

searching ‘samoyed’ like this in github we get a dataset for pets classification called imagewoof from fastai 2020 tutorial series. more image classes like subcategories of cats may be found in imagenet.

Read More

2022-07-18
Sentence Word Order Corrector

design a model to accept fixed length word type sequence and output word order token. the token is used to decode the final word sequence, just like the convolution but different.

input can be both misplaced sentences or correct sentences

looking for english word order correctifier.(grammar)

Read More

2022-07-13
Topic Generation 话题发现 趋势发现 热点发现

Read More

2022-06-09
Nlp Packages

NLP NLG Packages

questgen.ai:

generate question from essay, imitate interaction

增加观众互动性 生成问题

question answering question generator

甲骨 jiagu nlp包 provided by ownthink:

https://github.com/ownthink/Jiagu

  • 中文分词

  • 词性标注

  • 命名实体识别

  • 知识图谱关系抽取

  • 关键词提取

  • 文本摘要

  • 新词发现

  • 情感分析

  • 文本聚类

haystack:

nlp framework

neural search neural text search

semantic search

summarization

question answering

snownlp:

chinese segmentation, pinyin, sentiment analysis (情感分析), word tags, keywords, summary, tf-idf similarity, classification, 繁体转简体

Read More

2022-05-29
Mastering Text Classification: Exploring Nlp Techniques With Bert-Ner, Albert-Ner, Gpt2, And More

GAN for NLP text generation

GAN Journey:

https://github.com/nutllwhy/gan-journey

NLPGNN:

https://github.com/kyzhouhzau/NLPGNN

Examples (See tests for more details):

BERT-NER (Chinese and English Version)

BERT-CRF-NER (Chinese and English Version)

BERT-CLS (Chinese and English Version)

ALBERT-NER (Chinese and English Version)

ALBERT-CLS (Chinese and English Version)

GPT2-generation (English Version)

Bilstm+Attention (Chinese and English Version)

TextCNN(Chinese and English Version)

GCN, GAN, GIN, GraphSAGE (Base on message passing)

TextGCN and TextSAGE for text classification

Read More

2022-05-29
Chinese Input Method Or Engine

Chinese Input Method/Engine

using python:

https://github.com/R0uter/LoginputEngine

pinyin2hanzi:

https://github.com/letiantian/Pinyin2Hanzi

Python chinese to pinyin:

https://github.com/mozillazg/python-pinyin

pyim tsinghua dict(for emacs):

https://github.com/redguardtoo/pyim-tsinghua-dict

chinese input method dict converter:

https://github.com/studyzy/imewlconverter

Read More