Skip to content

Add training script for language models #344

Closed
@bact

Description

@bact

Almost all models we use now (see list in #298) are trained privately by different contributors. With code on notebooks or scripts that may be private or may be open source but difficult to follow.

To make PyThaiNLP more transparent and more customizable by users, should try to put training scripts or instructions (can be pointers) somewhere in the repo.

Known scripts/notebooks and data

Model Filename Training Script Training Data
CRF-Cut sentenceseg-ted.model https://colab.research.google.com/drive/12nszk-N5LwpHzitlYvhNWVUDSBj30Z1Y https://github.com/vistec-AI/ted_crawler
Enhanced Thai Character Cluster (ETCC) etcc.txt https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ
Language model (Thai Wikipedia) thwiki_lm.pth ? ?
Thai Grapheme-to-Phoneme (Thai G2P) thaig2p-0.1.tar https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb https://github.com/wannaphong/thai-g2p/blob/master/wiktionary-11-2-2020.tsv
Thai word vector thai2vec.bin https://github.com/cstorm125/thai2fit ?
Sentence segmentation (TED) sentenceseg-ted.model https://github.com/vistec-AI/ted_crawler TED Thai subtitles
Named-Entity Recognition data.model https://github.com/wannaphongcom/thai-ner ?
Thai Wikipedia (for?) thwiki_itos.pkl ? ?
POS Tagger ud_thai-pud_pt_tagger.dill https://github.com/PyThaiNLP/pythainlp_notebook/tree/master/postag ?
Thai Romanization thai2rom-pytorch-attn-v0.1.tar https://github.com/artificiala/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb https://github.com/wannaphong/thai-romanization
Thai Romanization v2 thai2rom-v2.hdf5 ? ?
Thai Romanization thai2rom-pytorch.tar https://github.com/artificiala/thai-romanization https://github.com/wannaphongcom/thai-romanization/

Metadata

Metadata

Assignees

No one assigned

    Labels

    corpuscorpus/dataset-related issuesdocumentationimprove documentation and test casesenhancementenhance functionalities

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions