kmeseg 0.1.1

Creator: bradpython12

Last updated:

Add to Cart

Description:

kmeseg 0.1.1

KME Word Segmentation
AI for tokenize chemical IUPAC name using tensorflow and keras.
Prepare training dataset
What to have


CHAR_INDICES: dictionary with key is character [string], value is number [int] (use to preprocess text to number)


Dict_cut: Input text with determined (by '|' ) where to be cut (use for create label)


Cyclo|prop|ane| |Non|a|-|1|,|8|-|di|yne|
|1|,|3|-|di|chlorocyclo|hex|ane| |Hept|a|-|1|,|5|-|di|ene


Dict: Input text (raw) (use for train model)

Cyclopropane Nona-1,8-diyne 1,3-dichlorocyclohexane Hepta-1,5-diene

Create dataset

Make JSON value as array of chemical name (dataset, dataset_cut)
Split array for training dataset (90%) (dataset_train, dataset_cut_train) and validation dataset (10%) (dataset_val, dataset_cut_val)
Join item in each array together into text
Create dataset using create_dataset function that take dataset_cut then return X_train (size: [text_length, look_back]) (dataset_cut that have been cut '|') and label (position where to cut 1 = cut, 0 = not cut)
Use tf.data.Dataset.from_tensor_slices((X, y)).batch_size(128) to make data easy to be train

Create Model

We use 1xEmbedding layer, 1xBidirection LSTM layer, Dense Layer
Compiled model optimizer = Adam, loss_function = Categorical Crossentropy (becase we classify 2 label output 1 = cut, 0 = not cut) call_back = [EarlyStopping, ModelCheckpoint]
Early stopping : Stop train model if validation_loss is being increase
ModelCheckpoint : Save model that has minimum validation_loss

After Train Model

The output of the model is array (size: [batch_size, 2] determined which position to be cut (value = 0 -> not cut ; 1 -> cut))

[1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0]


Tokenize dataset (text which didn't determined where to be cut) with label (output from model) using word_tokenize function that return array of text that has been cut

['1', ',', '2', '-', 'di', 'h', 'ydrox', 'y', '-', '2', '-', 'meth', 'yl', 'prop', 'ane']

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.