ckip-transformers 0.3.4

Creator: bradpython12

Last updated:

0 purchases

TODO
Add to Cart

Description:

ckiptransformers 0.3.4

CKIP Transformers

This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).
這個專案提供了繁體中文的 transformers 模型(包含 ALBERT、BERT、GPT2)及自然語言處理工具(包含斷詞、詞性標記、實體辨識)。


Git

https://github.com/ckiplab/ckip-transformers




PyPI

https://pypi.org/project/ckip-transformers




Documentation

https://ckip-transformers.readthedocs.io




Demo

https://ckip.iis.sinica.edu.tw/service/transformers




Contributers

Mu Yang at CKIP (Author & Maintainer).
Wei-Yun Ma at CKIP (Maintainer).



Related Packages

CkipTagger: An alternative Chinese NLP library with using BiLSTM.
CKIP CoreNLP Toolkit: A Chinese NLP library with more NLP tasks and utilities.




Models

You may also use our pretrained models with HuggingFace transformers library directly: https://huggingface.co/ckiplab/.
您可於 https://huggingface.co/ckiplab/ 下載預訓練的模型。



Language Models

ALBERT Tiny: ckiplab/albert-tiny-chinese
ALBERT Base: ckiplab/albert-base-chinese
BERT Tiny: ckiplab/bert-tiny-chinese
BERT Base: ckiplab/bert-base-chinese
GPT2 Tiny: ckiplab/gpt2-tiny-chinese
GPT2 Base: ckiplab/gpt2-base-chinese





NLP Task Models

ALBERT Tiny — Word Segmentation: ckiplab/albert-tiny-chinese-ws
ALBERT Tiny — Part-of-Speech Tagging: ckiplab/albert-tiny-chinese-pos
ALBERT Tiny — Named-Entity Recognition: ckiplab/albert-tiny-chinese-ner
ALBERT Base — Word Segmentation: ckiplab/albert-base-chinese-ws
ALBERT Base — Part-of-Speech Tagging: ckiplab/albert-base-chinese-pos
ALBERT Base — Named-Entity Recognition: ckiplab/albert-base-chinese-ner
BERT Tiny — Word Segmentation: ckiplab/bert-tiny-chinese-ws
BERT Tiny — Part-of-Speech Tagging: ckiplab/bert-tiny-chinese-pos
BERT Tiny — Named-Entity Recognition: ckiplab/bert-tiny-chinese-ner
BERT Base — Word Segmentation: ckiplab/bert-base-chinese-ws
BERT Base — Part-of-Speech Tagging: ckiplab/bert-base-chinese-pos
BERT Base — Named-Entity Recognition: ckiplab/bert-base-chinese-ner






Model Usage

You may use our model directly from the HuggingFace’s transformers library.
您可直接透過 HuggingFace’s transformers 套件使用我們的模型。

pip install -U transformers

Please use BertTokenizerFast as tokenizer, and replace ckiplab/albert-tiny-chinese and ckiplab/albert-tiny-chinese-ws by any model you need in the following example.
請使用內建的 BertTokenizerFast,並將以下範例中的 ckiplab/albert-tiny-chinese 與 ckiplab/albert-tiny-chinese-ws 替換成任何您要使用的模型名稱。

from transformers import (
BertTokenizerFast,
AutoModelForMaskedLM,
AutoModelForCausalLM,
AutoModelForTokenClassification,
)

# masked language model (ALBERT, BERT)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForMaskedLM.from_pretrained('ckiplab/albert-tiny-chinese') # or other models above

# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above

# nlp task model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above


Model Fine-Tunning

To fine tunning our model on your own datasets, please refer to the following example from HuggingFace’s transformers.
您可參考以下的範例去微調我們的模型於您自己的資料集。


https://github.com/huggingface/transformers/tree/master/examples
https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling
https://github.com/huggingface/transformers/tree/master/examples/pytorch/token-classification


Remember to set --tokenizer_name bert-base-chinese in order to use Chinese tokenizer.
記得設置 --tokenizer_name bert-base-chinese 以正確的使用中文的 tokenizer。

python run_mlm.py \
--model_name_or_path ckiplab/albert-tiny-chinese \ # or other models above
--tokenizer_name bert-base-chinese \
...

python run_ner.py \
--model_name_or_path ckiplab/albert-tiny-chinese-ws \ # or other models above
--tokenizer_name bert-base-chinese \
...


Model Performance

The following is a performance comparison between our model and other models.
The results are tested on a traditional Chinese corpus.
以下是我們的模型與其他的模型之性能比較。
各個任務皆測試於繁體中文的測試集。



Model
#Parameters
Perplexity†
WS (F1)‡
POS (ACC)‡
NER (F1)‡



ckiplab/albert-tiny-chinese
4M
4.80
96.66%
94.48%
71.17%

ckiplab/albert-base-chinese
11M
2.65
97.33%
95.30%
79.47%

ckiplab/bert-tiny-chinese
12M
8.07
96.98%
95.11%
74.21%

ckiplab/bert-base-chinese
102M
1.88
97.60%
95.67%
81.18%

ckiplab/gpt2-tiny-chinese
4M
16.94




ckiplab/gpt2-base-chinese
102M
8.36











voidful/albert_chinese_tiny
4M
74.93




voidful/albert_chinese_base
11M
22.34




bert-base-chinese
102M
2.53







† Perplexity; the smaller the better.
† 混淆度;數字越小越好。
‡ WS: word segmentation; POS: part-of-speech; NER: named-entity recognition; the larger the better.
‡ WS: 斷詞;POS: 詞性標記;NER: 實體辨識;數字越大越好。



Training Corpus

The language models are trained on the ZhWiki and CNA datasets; the WS and POS tasks are trained on the ASBC dataset; the NER tasks are trained on the OntoNotes dataset.
以上的語言模型訓練於 ZhWiki 與 CNA 資料集上;斷詞(WS)與詞性標記(POS)任務模型訓練於 ASBC 資料集上;實體辨識(NER)任務模型訓練於 OntoNotes 資料集上。



ZhWiki: https://dumps.wikimedia.org/zhwiki/

Chinese Wikipedia text (20200801 dump), translated to Traditional using OpenCC.
中文維基的文章(20200801 版本),利用 OpenCC 翻譯成繁體中文。





CNA: https://catalog.ldc.upenn.edu/LDC2011T13

Chinese Gigaword Fifth Edition — CNA (Central News Agency) part.
中文 Gigaword 第五版 — CNA(中央社)的部分。





ASBC: http://asbc.iis.sinica.edu.tw

Academia Sinica Balanced Corpus of Modern Chinese release 4.0.
中央研究院漢語平衡語料庫第四版。





OntoNotes: https://catalog.ldc.upenn.edu/LDC2013T19

OntoNotes release 5.0, Chinese part, translated to Traditional using OpenCC.
OntoNotes 第五版,中文部分,利用 OpenCC 翻譯成繁體中文。






Here is a summary of each corpus.
以下是各個資料集的一覽表。



Dataset
#Documents
#Lines
#Characters
Line Type



CNA
2,559,520
13,532,445
1,219,029,974
Paragraph

ZhWiki
1,106,783
5,918,975
495,446,829
Paragraph

ASBC
19,247
1,395,949
17,572,374
Clause

OntoNotes
1,911
48,067
1,568,491
Sentence




Here is the dataset split used for language models.
以下是用於訓練語言模型的資料集切割。



CNA+ZhWiki
#Documents
#Lines
#Characters



Train
3,606,303
18,986,238
4,347,517,682

Dev
30,000
148,077
32,888,978

Test
30,000
151,241
35,216,818




Here is the dataset split used for word segmentation and part-of-speech tagging models.
以下是用於訓練斷詞及詞性標記模型的資料集切割。



ASBC
#Documents
#Lines
#Words
#Characters



Train
15,247
1,183,260
9,480,899
14,724,250

Dev
2,000
52,677
448,964
741,323

Test
2,000
160,012
1,315,129
2,106,799




Here is the dataset split used for word segmentation and named entity recognition models.
以下是用於訓練實體辨識模型的資料集切割。



OntoNotes
#Documents
#Lines
#Characters
#Named-Entities



Train
1,511
43,362
1,367,658
68,947

Dev
200
2,304
93,535
7,186

Test
200
2,401
107,298
6,977






NLP Tools

The package also provide the following NLP tools.
我們的套件也提供了以下的自然語言處理工具。


(WS) Word Segmentation 斷詞
(POS) Part-of-Speech Tagging 詞性標記
(NER) Named Entity Recognition 實體辨識


Installation
pip install -U ckip-transformers
Requirements:

Python 3.6+
PyTorch 1.5+
HuggingFace Transformers 3.5+



NLP Tools Usage

See here for API details.
詳細的 API 請參見 此處 。


The complete script of this example is https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py.
以下的範例的完整檔案可參見 https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py 。


1. Import module
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker


2. Load models

We provide several pretrained models for the NLP tools.
我們提供了一些適用於自然語言工具的預訓練的模型。

# Initialize drivers
ws_driver = CkipWordSegmenter(model="bert-base")
pos_driver = CkipPosTagger(model="bert-base")
ner_driver = CkipNerChunker(model="bert-base")

One may also load their own checkpoints using our drivers.
也可以運用我們的工具於自己訓練的模型上。

# Initialize drivers with custom checkpoints
ws_driver = CkipWordSegmenter(model_name="path_to_your_model")
pos_driver = CkipPosTagger(model_name="path_to_your_model")
ner_driver = CkipNerChunker(model_name="path_to_your_model")

To use GPU, one may specify device ID while initialize the drivers. Set to -1 (default) to disable GPU.
可於宣告斷詞等工具時指定 device 以使用 GPU,設為 -1 (預設值)代表不使用 GPU。

# Use CPU
ws_driver = CkipWordSegmenter(device=-1)

# Use GPU:0
ws_driver = CkipWordSegmenter(device=0)


3. Run pipeline

The input for word segmentation and named-entity recognition must be a list of sentences.
The input for part-of-speech tagging must be a list of list of words (the output of word segmentation).
斷詞與實體辨識的輸入必須是 list of sentences。
詞性標記的輸入必須是 list of list of words。

# Input text
text = [
"傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。",
"美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。",
"空白 也是可以的~",
]

# Run pipeline
ws = ws_driver(text)
pos = pos_driver(ws)
ner = ner_driver(text)

The POS driver will automatically segment the sentence internally using there characters ',,。::;;!!??' while running the model. (The output sentences will be concatenated back.) You may set delim_set to any characters you want.
You may set use_delim=False to disable this feature, or set use_delim=True in WS and NER driver to enable this feature.
詞性標記工具會自動用 ',,。::;;!!??' 等字元在執行模型前切割句子(輸出的句子會自動接回)。可設定 delim_set 參數使用別的字元做切割。
另外可指定 use_delim=False 已停用此功能,或於斷詞、實體辨識時指定 use_delim=True 已啟用此功能。

# Enable sentence segmentation
ws = ws_driver(text, use_delim=True)
ner = ner_driver(text, use_delim=True)

# Disable sentence segmentation
pos = pos_driver(ws, use_delim=False)

# Use new line characters and tabs for sentence segmentation
pos = pos_driver(ws, delim_set='\n\t')

You may specify batch_size and max_length to better utilize you machine resources.
您亦可設置 batch_size 與 max_length 以更完美的利用您的機器資源。

# Sets the batch size and maximum sentence length
ws = ws_driver(text, batch_size=256, max_length=128)


4. Show results
# Pack word segmentation and part-of-speech results
def pack_ws_pos_sentece(sentence_ws, sentence_pos):
assert len(sentence_ws) == len(sentence_pos)
res = []
for word_ws, word_pos in zip(sentence_ws, sentence_pos):
res.append(f"{word_ws}({word_pos})")
return "\u3000".join(res)

# Show results
for sentence, sentence_ws, sentence_pos, sentence_ner in zip(text, ws, pos, ner):
print(sentence)
print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
for entity in sentence_ner:
print(entity)
print()
傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。
傅達仁(Nb) 今(Nd) 將(D) 執行(VC) 安樂死(Na) ,(COMMACATEGORY) 卻(D) 突然(D) 爆出(VJ) 自己(Nh) 20(Neu) 年(Nd) 前(Ng) 遭(P) 緯來(Nb) 體育台(Na) 封殺(VC) ,(COMMACATEGORY) 他(Nh) 不(D) 懂(VK) 自己(Nh) 哪裡(Ncd) 得罪到(VC) 電視台(Nc) 。(PERIODCATEGORY)
NerToken(word='傅達仁', ner='PERSON', idx=(0, 3))
NerToken(word='20年', ner='DATE', idx=(18, 21))
NerToken(word='緯來體育台', ner='ORG', idx=(23, 28))

美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。
美國(Nc) 參議院(Nc) 針對(P) 今天(Nd) 總統(Na) 布什(Nb) 所(D) 提名(VC) 的(DE) 勞工部長(Na) 趙小蘭(Nb) 展開(VC) 認可(VC) 聽證會(Na) ,(COMMACATEGORY) 預料(VE) 她(Nh) 將(D) 會(D) 很(Dfa) 順利(VH) 通過(VC) 參議院(Nc) 支持(VC) ,(COMMACATEGORY) 成為(VG) 該(Nes) 國(Nc) 有史以來(D) 第一(Neu) 位(Nf) 的(DE) 華裔(Na) 女性(Na) 內閣(Na) 成員(Na) 。(PERIODCATEGORY)
NerToken(word='美國參議院', ner='ORG', idx=(0, 5))
NerToken(word='今天', ner='LOC', idx=(7, 9))
NerToken(word='布什', ner='PERSON', idx=(11, 13))
NerToken(word='勞工部長', ner='ORG', idx=(17, 21))
NerToken(word='趙小蘭', ner='PERSON', idx=(21, 24))
NerToken(word='認可聽證會', ner='EVENT', idx=(26, 31))
NerToken(word='參議院', ner='ORG', idx=(42, 45))
NerToken(word='第一', ner='ORDINAL', idx=(56, 58))
NerToken(word='華裔', ner='NORP', idx=(60, 62))

空白 也是可以的~
空白(VH)  (WHITESPACE) 也(D) 是(SHI) 可以(VH) 的(T) ~(FW)



NLP Tools Performance

The following is a performance comparison between our tool and other tools.
以下是我們的工具與其他的工具之性能比較。


CKIP Transformers v.s. Monpa & Jeiba


Tool
WS (F1)
POS (Acc)
WS+POS (F1)
NER (F1)



CKIP BERT Base
97.60%
95.67%
94.19%
81.18%

CKIP ALBERT Base
97.33%
95.30%
93.52%
79.47%

CKIP BERT Tiny
96.98%
95.08%
93.13%
74.20%

CKIP ALBERT Tiny
96.66%
94.48%
92.25%
71.17%







Monpa†
92.58%

83.88%


Jeiba
81.18%







† Monpa provides only 3 types of tags in NER.
† Monpa 的實體辨識僅提供三種標記而已。



CKIP Transformers v.s. CkipTagger

The following results are tested on a different dataset.†
以下實驗在另一個資料集測試。†



Tool
WS (F1)
POS (Acc)
WS+POS (F1)
NER (F1)



CKIP BERT Base
97.84%
96.46%
94.91%
79.20%

CkipTagger
97.33%
97.20%
94.75%
77.87%




† Here we retrained/tested our BERT model using the same dataset with CkipTagger.
† 我們重新訓練/測試我們的 BERT 模型於跟 CkipTagger 相同的資料集。





License

Copyright (c) 2023 CKIP Lab under the GPL-3.0 License.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Files:

Customer Reviews

There are no reviews.