Add training script for language models

Almost all models we use now (see list in #298) are trained privately by different contributors. With code on notebooks or scripts that may be private or may be open source but difficult to follow.

To make PyThaiNLP more transparent and more customizable by users, should try to put training scripts or instructions (can be pointers) somewhere in the repo.

## Known scripts/notebooks and data

| Model | Filename | Training Script | Training Data |
|-------|-----------|---------------|---------------|
| CRF-Cut | sentenceseg-ted.model | https://colab.research.google.com/drive/12nszk-N5LwpHzitlYvhNWVUDSBj30Z1Y | https://github.com/vistec-AI/ted_crawler |
| Enhanced Thai Character Cluster (ETCC) | etcc.txt | https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ | https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ |
| Language model (Thai Wikipedia) | thwiki_lm.pth | ? | ? |
| Thai Grapheme-to-Phoneme (Thai G2P) | thaig2p-0.1.tar | https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb | https://github.com/wannaphong/thai-g2p/blob/master/wiktionary-11-2-2020.tsv |
| Thai word vector | thai2vec.bin | https://github.com/cstorm125/thai2fit |  ? |
| Sentence segmentation (TED) | sentenceseg-ted.model | https://github.com/vistec-AI/ted_crawler | TED Thai subtitles |
| Named-Entity Recognition | data.model | https://github.com/wannaphongcom/thai-ner | ? |
| Thai Wikipedia (for?) | thwiki_itos.pkl | ? | ? |
| POS Tagger | ud_thai-pud_pt_tagger.dill | https://github.com/PyThaiNLP/pythainlp_notebook/tree/master/postag | ? |
| Thai Romanization | thai2rom-pytorch-attn-v0.1.tar | https://github.com/artificiala/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb | https://github.com/wannaphong/thai-romanization |
| Thai Romanization v2 | thai2rom-v2.hdf5 | ? | ? |
| Thai Romanization | thai2rom-pytorch.tar | https://github.com/artificiala/thai-romanization | https://github.com/wannaphongcom/thai-romanization/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add training script for language models #344

Known scripts/notebooks and data

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Filename	Training Script	Training Data
CRF-Cut	sentenceseg-ted.model	https://colab.research.google.com/drive/12nszk-N5LwpHzitlYvhNWVUDSBj30Z1Y	https://github.com/vistec-AI/ted_crawler
Enhanced Thai Character Cluster (ETCC)	etcc.txt	https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ	https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ
Language model (Thai Wikipedia)	thwiki_lm.pth	?	?
Thai Grapheme-to-Phoneme (Thai G2P)	thaig2p-0.1.tar	https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb	https://github.com/wannaphong/thai-g2p/blob/master/wiktionary-11-2-2020.tsv
Thai word vector	thai2vec.bin	https://github.com/cstorm125/thai2fit	?
Sentence segmentation (TED)	sentenceseg-ted.model	https://github.com/vistec-AI/ted_crawler	TED Thai subtitles
Named-Entity Recognition	data.model	https://github.com/wannaphongcom/thai-ner	?
Thai Wikipedia (for?)	thwiki_itos.pkl	?	?
POS Tagger	ud_thai-pud_pt_tagger.dill	https://github.com/PyThaiNLP/pythainlp_notebook/tree/master/postag	?
Thai Romanization	thai2rom-pytorch-attn-v0.1.tar	https://github.com/artificiala/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb	https://github.com/wannaphong/thai-romanization
Thai Romanization v2	thai2rom-v2.hdf5	?	?
Thai Romanization	thai2rom-pytorch.tar	https://github.com/artificiala/thai-romanization	https://github.com/wannaphongcom/thai-romanization/

Add training script for language models #344

Description

Known scripts/notebooks and data

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions