model package¶

Submodules¶

model.crf module¶

class model.crf.CRFDecode_vb(tagset_size, start_tag, end_tag, average_batch=True)[source]¶

Bases: object

Batch-mode viterbi decode

Parameters:	tagset_size – target_set_size start_tag – ind for <start> end_tag – ind for <pad> average_batch – whether average the loss among batch

decode(scores, mask)[source]¶

Find the optimal path with viterbe decode

Parameters:	scores (size seq_len, bat_size, target_size_from, target_size_to) – crf scores mask (seq_len, bat_size) – mask for padding
Returns:	decoded sequence (size seq_len, bat_size)

class model.crf.CRFLoss_gd(tagset_size, start_tag, end_tag, average_batch=True)[source]¶

Bases: torch.nn.modules.module.Module

loss for greedy decode loss, i.e., although its for CRF Layer, we calculate the loss as

\[\sum_{j=1}^n \log (p(\hat{y}_{j+1}|z_{j+1}, \hat{y}_{j}))\]

instead of

\[\sum_{j=1}^n \log (\phi(\hat{y}_{j-1}, \hat{y}_j, \mathbf{z}_j)) - \log (\sum_{\mathbf{y}' \in \mathbf{Y}(\mathbf{Z})} \prod_{j=1}^n \phi(y'_{j-1}, y'_j, \mathbf{z}_j) )\]

Parameters:	tagset_size – target_set_size start_tag – ind for <start> end_tag – ind for <pad> average_batch – whether average the loss among batch

forward(scores, target, current)[source]¶

Parameters:	scores (Word_Seq_len, Batch_size, target_size_from, target_size_to) – crf scores target (Word_Seq_len, Batch_size) – golden list current (Word_Seq_len, Batch_size) – current state
Returns:	crf greedy loss

class model.crf.CRFLoss_vb(tagset_size, start_tag, end_tag, average_batch=True)[source]¶

Bases: torch.nn.modules.module.Module

loss for viterbi decode

\[\sum_{j=1}^n \log (\phi(\hat{y}_{j-1}, \hat{y}_j, \mathbf{z}_j)) - \log (\sum_{\mathbf{y}' \in \mathbf{Y}(\mathbf{Z})} \prod_{j=1}^n \phi(y'_{j-1}, y'_j, \mathbf{z}_j) )\]

Parameters:	tagset_size – target_set_size start_tag – ind for <start> end_tag – ind for <pad> average_batch – whether average the loss among batch

forward(scores, target, mask)[source]¶

Parameters:	scores (seq_len, bat_size, target_size_from, target_size_to) – crf scores target (seq_len, bat_size, 1) – golden state mask (size seq_len, bat_size) – mask for padding
Returns:	loss

class model.crf.CRFRepack(tagset_size, if_cuda)[source]¶

Bases: object

Packer for word level model

Parameters:	tagset_size – target_set_size if_cuda – whether use GPU

convert_for_eval(target)[source]¶

convert target to original decoding

Parameters:	target – input labels used in training
Returns:	output labels used in test

repack_gd(feature, target, current)[source]¶

packer for greedy loss

Parameters:	feature (Seq_len, Batch_size) – input feature target (Seq_len, Batch_size) – output target current (Seq_len, Batch_size) – current state
Returns:	feature (Seq_len, Batch_size), target (Seq_len * Batch_size), current (Seq_len * Batch_size, 1, 1)

repack_vb(feature, target, mask)[source]¶

packer for viterbi loss

Parameters:	feature (Seq_len, Batch_size) – input feature target (Seq_len, Batch_size) – output target mask (Seq_len, Batch_size) – padding mask
Returns:	feature (Seq_len, Batch_size), target (Seq_len, Batch_size), mask (Seq_len, Batch_size)

class model.crf.CRFRepack_WC(tagset_size, if_cuda)[source]¶

Bases: object

Packer for model with char-level and word-level

Parameters:	tagset_size – target_set_size if_cuda – whether use GPU

convert_for_eval(target)[source]¶

convert for eval

Parameters:	target – input labels used in training
Returns:	output labels used in test

repack_vb(fc_feature, fc_position, bc_feature, bc_position, word_feature, target, mask, batch_len)[source]¶

packer for viterbi loss

Parameters:

fc_feature (Char_Seq_len, Batch_size) – forward_char input feature
fc_position (Word_Seq_len, Batch_size) – forward_char input position
bc_feature (Char_Seq_len, Batch_size) – backward_char input feature
bc_position (Word_Seq_len, Batch_size) – backward_char input position
word_feature (Word_Seq_len, Batch_size) – input word feature
target (Seq_len, Batch_size) – output target
mask (Word_Seq_len, Batch_size) – padding mask
batch_len (Batch_size, 2) – length of instances in one batch

Returns:

f_f (Char_Reduced_Seq_len, Batch_size), f_p (Word_Reduced_Seq_len, Batch_size), b_f (Char_Reduced_Seq_len, Batch_size), b_p (Word_Reduced_Seq_len, Batch_size), w_f (size Word_Seq_Len, Batch_size), target (Reduced_Seq_len, Batch_size), mask (Word_Reduced_Seq_len, Batch_size)

class model.crf.CRF_L(hidden_dim, tagset_size, if_bias=True)[source]¶

Bases: torch.nn.modules.module.Module

Conditional Random Field (CRF) layer. This version is used in Ma et al. 2016, has more parameters than CRF_S

Parameters:	hidden_dim – input dim size tagset_size – target_set_size if_biase – whether allow bias in linear trans

forward(feats)[source]¶

Parameters:	feats (batch_size, seq_len, hidden_dim) – input score from previous layers
Returns:	output from crf layer (batch_size, seq_len, tag_size, tag_size)

rand_init()[source]¶: random initialization

class model.crf.CRF_S(hidden_dim, tagset_size, if_bias=True)[source]¶

Bases: torch.nn.modules.module.Module

Conditional Random Field (CRF) layer. This version is used in Lample et al. 2016, has less parameters than CRF_L.

Parameters:	hidden_dim – input dim size tagset_size – target_set_size if_biase – whether allow bias in linear trans

forward(feats)[source]¶

Parameters:	feats (batch_size, seq_len, hidden_dim) – input score from previous layers
Returns:	output from crf layer ( (batch_size * seq_len), tag_size, tag_size)

rand_init()[source]¶: random initialization

model.evaluator module¶

class model.evaluator.eval_batch(packer, l_map)[source]¶

Bases: object

Base class for evaluation, provide method to calculate f1 score and accuracy

Parameters:	packer – provide method to convert target into original space [TODO: need to improve] l_map – dictionary for labels

acc_score()[source]¶: calculate accuracy score based on statics

calc_acc_batch(decoded_data, target_data)[source]¶

update statics for accuracy

Parameters:	decoded_data (batch_size, seq_len) – prediction sequence target_data (batch_size, seq_len) – ground-truth

calc_f1_batch(decoded_data, target_data)[source]¶

update statics for f1 score

Parameters:	decoded_data (batch_size, seq_len) – prediction sequence target_data (batch_size, seq_len) – ground-truth

eval_instance(best_path, gold)[source]¶

update statics for one instance

Parameters:	best_path (seq_len) – predicted gold (seq_len) – ground-truth

f1_score()[source]¶: calculate f1 score based on statics

reset()[source]¶: re-set all states

class model.evaluator.eval_w(packer, l_map, score_type)[source]¶

Bases: model.evaluator.eval_batch

evaluation class for word level model (LSTM-CRF)

Parameters:	packer – provide method to convert target into original space [TODO: need to improve] l_map – dictionary for labels score_type – use f1score with using ‘f’

calc_score(ner_model, dataset_loader)[source]¶

calculate score for pre-selected metrics

Parameters:	ner_model – LSTM-CRF model dataset_loader – loader class for test set

class model.evaluator.eval_wc(packer, l_map, score_type)[source]¶

Bases: model.evaluator.eval_batch

evaluation class for LM-LSTM-CRF

Parameters:	packer – provide method to convert target into original space [TODO: need to improve] l_map – dictionary for labels score_type – use f1score with using ‘f’

calc_score(ner_model, dataset_loader)[source]¶

calculate score for pre-selected metrics

Parameters:	ner_model – LM-LSTM-CRF model dataset_loader – loader class for test set

model.highway module¶

class model.highway.hw(size, num_layers=1, dropout_ratio=0.5)[source]¶

Bases: torch.nn.modules.module.Module

Highway layers

Parameters:	size – input and output dimension dropout_ratio – dropout ratio

forward(x)[source]¶

update statics for f1 score

Parameters:	x (ins_num, hidden_dim) – input tensor
Returns:	output tensor (ins_num, hidden_dim)

rand_init()[source]¶: random initialization

model.lm_lstm_crf module¶

class model.lm_lstm_crf.LM_LSTM_CRF(tagset_size, char_size, char_dim, char_hidden_dim, char_rnn_layers, embedding_dim, word_hidden_dim, word_rnn_layers, vocab_size, dropout_ratio, large_CRF=True, if_highway=False, in_doc_words=2, highway_layers=1)[source]¶

Bases: torch.nn.modules.module.Module

LM_LSTM_CRF model

Parameters:

tagset_size – size of label set
char_size – size of char dictionary
char_dim – size of char embedding
char_hidden_dim – size of char-level lstm hidden dim
char_rnn_layers – number of char-level lstm layers
embedding_dim – size of word embedding
word_hidden_dim – size of word-level blstm hidden dim
word_rnn_layers – number of word-level lstm layers
vocab_size – size of word dictionary
dropout_ratio – dropout ratio
large_CRF – use CRF_L or not, refer model.crf.CRF_L and model.crf.CRF_S for more details
if_highway – use highway layers or not
in_doc_words – number of words that occurred in the corpus (used for language model prediction)
highway_layers – number of highway layers

forward(forw_sentence, forw_position, back_sentence, back_position, word_seq, hidden=None)[source]¶

Parameters:

forw_sentence (char_seq_len, batch_size) – char-level representation of sentence
forw_position (word_seq_len, batch_size) – position of blank space in char-level representation of sentence
back_sentence (char_seq_len, batch_size) – char-level representation of sentence (inverse order)
back_position (word_seq_len, batch_size) – position of blank space in inversed char-level representation of sentence
word_seq (word_seq_len, batch_size) – word-level representation of sentence
hidden – initial hidden state

Returns:

crf output (word_seq_len, batch_size, tag_size, tag_size), hidden

load_pretrained_word_embedding(pre_word_embeddings)[source]¶

load pre-trained word embedding

Parameters:	pre_word_embeddings (self.word_size, self.word_dim) – pre-trained embedding

rand_init(init_char_embedding=True, init_word_embedding=False)[source]¶

random initialization

Parameters:	init_char_embedding – random initialize char embedding or not init_word_embedding – random initialize word embedding or not

rand_init_embedding()[source]¶: random initialize char-level embedding

set_batch_seq_size(sentence)[source]¶: set batch size and sequence length

set_batch_size(bsize)[source]¶: set batch size

word_pre_train_backward(sentence, position, hidden=None)[source]¶

output of backward language model

Parameters:	sentence (char_seq_len, batch_size) – char-level representation of sentence (inverse order) position (word_seq_len, batch_size) – position of blank space in inversed char-level representation of sentence hidden – initial hidden state
Returns:	language model output (word_seq_len, in_doc_word), hidden

word_pre_train_forward(sentence, position, hidden=None)[source]¶

output of forward language model

Parameters:	sentence (char_seq_len, batch_size) – char-level representation of sentence position (word_seq_len, batch_size) – position of blank space in char-level representation of sentence hidden – initial hidden state
Returns:	language model output (word_seq_len, in_doc_word), hidden

model.lstm_crf module¶

class model.lstm_crf.LSTM_CRF(vocab_size, tagset_size, embedding_dim, hidden_dim, rnn_layers, dropout_ratio, large_CRF=True)[source]¶

Bases: torch.nn.modules.module.Module

LSTM_CRF model

Parameters:	vocab_size – size of word dictionary tagset_size – size of label set embedding_dim – size of word embedding hidden_dim – size of word-level blstm hidden dim rnn_layers – number of word-level lstm layers dropout_ratio – dropout ratio large_CRF – use CRF_L or not, refer model.crf.CRF_L and model.crf.CRF_S for more details

forward(sentence, hidden=None)[source]¶

Parameters:	sentence (word_seq_len, batch_size) – word-level representation of sentence hidden – initial hidden state
Returns:	crf output (word_seq_len, batch_size, tag_size, tag_size), hidden

load_pretrained_embedding(pre_embeddings)[source]¶

load pre-trained word embedding

Parameters:	pre_word_embeddings (self.word_size, self.word_dim) – pre-trained embedding

rand_init(init_embedding=False)[source]¶

random initialization

Parameters:	init_embedding – random initialize embedding or not

rand_init_embedding()[source]¶

rand_init_hidden()[source]¶: random initialize hidden variable

set_batch_seq_size(sentence)[source]¶: set batch size and sequence length

set_batch_size(bsize)[source]¶: set batch size

model.ner_dataset module¶

class model.ner_dataset.CRFDataset(data_tensor, label_tensor, mask_tensor)[source]¶

Bases: torch.utils.data.dataset.Dataset

Dataset Class for word-level model

Parameters:	data_tensor (ins_num, seq_length) – words label_tensor (ins_num, seq_length) – labels mask_tensor (ins_num, seq_length) – padding masks

class model.ner_dataset.CRFDataset_WC(forw_tensor, forw_index, back_tensor, back_index, word_tensor, label_tensor, mask_tensor, len_tensor)[source]¶

Bases: torch.utils.data.dataset.Dataset

Dataset Class for char-aware model

Parameters:

forw_tensor (ins_num, seq_length) – forward chars
forw_index (ins_num, seq_length) – index of forward chars
back_tensor (ins_num, seq_length) – backward chars
back_index (ins_num, seq_length) – index of backward chars
word_tensor (ins_num, seq_length) – words
label_tensor (ins_num, seq_length) – labels:
mask_tensor (ins_num, seq_length) – padding masks
len_tensor (ins_num, 2) – length of chars (dim0) and words (dim1)

model.utils module¶

model.utils.adjust_learning_rate(optimizer, lr)[source]¶: shrink learning rate for pytorch

model.utils.argmax(vec)[source]¶: helper function to calculate argmax of input vector at dimension 1

model.utils.calc_threshold_mean(features)[source]¶: calculate the threshold for bucket by mean

model.utils.concatChar(input_lines, char_dict)[source]¶

concat char into string

Parameters:	input_lines (list of list of char) – input corpus char_dict (dictionary) – char-level dictionary
Returns:	forw_lines

model.utils.construct_bucket_gd(input_features, input_labels, thresholds, pad_feature, pad_label)[source]¶: Construct bucket by thresholds for greedy decode, word-level only

model.utils.construct_bucket_mean_gd(input_features, input_label, word_dict, label_dict)[source]¶: Construct bucket by mean for greedy decode, word-level only

model.utils.construct_bucket_mean_vb(input_features, input_label, word_dict, label_dict, caseless)[source]¶: Construct bucket by mean for viterbi decode, word-level only

model.utils.construct_bucket_mean_vb_wc(word_features, input_label, label_dict, char_dict, word_dict, caseless)[source]¶: Construct bucket by mean for viterbi decode, word-level and char-level

model.utils.construct_bucket_vb(input_features, input_labels, thresholds, pad_feature, pad_label, label_size)[source]¶: Construct bucket by thresholds for viterbi decode, word-level only

model.utils.construct_bucket_vb_wc(word_features, forw_features, fea_len, input_labels, thresholds, pad_word_feature, pad_char_feature, pad_label, label_size)[source]¶: Construct bucket by thresholds for viterbi decode, word-level and char-level

model.utils.encode(input_lines, word_dict)[source]¶: encode list of strings into word-level representation

model.utils.encode2Tensor(input_lines, word_dict, unk)[source]¶: encode list of strings into word-level representation (tensor) with unk

model.utils.encode2char_safe(input_lines, char_dict)[source]¶

get char representation of lines

Parameters:	input_lines (list of strings) – input corpus char_dict (dictionary) – char-level dictionary
Returns:	forw_lines

model.utils.encode_corpus(lines, f_map, l_map, if_lower=False)[source]¶: encode corpus into features and labels

model.utils.encode_corpus_c(lines, f_map, l_map, c_map)[source]¶: encode corpus into features (both word-level and char-level) and labels

model.utils.encode_safe(input_lines, word_dict, unk)[source]¶: encode list of strings into word-level representation with unk

model.utils.fill_y(nc, yidx)[source]¶: fill y to dense matrix

model.utils.find_length_from_feats(feats, feat_to_ix)[source]¶: find length of unpadded features based on feature

model.utils.find_length_from_labels(labels, label_to_ix)[source]¶: find length of unpadded features based on labels

model.utils.generate_corpus(lines, if_shrink_feature=False, thresholds=1)[source]¶

generate label, feature, word dictionary and label dictionary

Parameters:	lines – corpus if_shrink_feature – whether shrink word-dictionary threshold – threshold for shrinking word-dictionary

model.utils.generate_corpus_char(lines, if_shrink_c_feature=False, c_thresholds=1, if_shrink_w_feature=False, w_thresholds=1)[source]¶

generate label, feature, word dictionary, char dictionary and label dictionary

Parameters:	lines – corpus if_shrink_c_feature – whether shrink char-dictionary c_threshold – threshold for shrinking char-dictionary if_shrink_w_feature – whether shrink word-dictionary w_threshold – threshold for shrinking word-dictionary

model.utils.init_embedding(input_embedding)[source]¶: Initialize embedding

model.utils.init_linear(input_linear)[source]¶: Initialize linear transformation

model.utils.init_lstm(input_lstm)[source]¶: Initialize lstm

model.utils.iob_to_spans(sequence, lut, strict_iob2=False)[source]¶: convert to iob to span

model.utils.iobes_to_spans(sequence, lut, strict_iob2=False)[source]¶: convert to iobes to span

model.utils.load_embedding(emb_file, delimiter, feature_map, caseless, unk, shrink_to_train=False)[source]¶: load embedding

model.utils.load_embedding_wlm(emb_file, delimiter, feature_map, full_feature_set, caseless, unk, emb_len, shrink_to_train=False, shrink_to_corpus=False)[source]¶

load embedding, indoc words would be listed before outdoc words

Parameters:

emb_file – path to embedding file
delimiter – delimiter of lines
feature_map – word dictionary
full_feature_set – all words in the corpus
caseless – convert into casesless style
unk – string for unknown token
emb_len – dimension of embedding vectors
shrink_to_train – whether to shrink out-of-training set or not
shrink_to_corpus – whether to shrink out-of-corpus or not

model.utils.log_sum_exp(vec, m_size)[source]¶

calculate log of exp sum

Parameters:	vec (batch_size, vanishing_dim, hidden_dim) – input tensor m_size – hidden_dim
Returns:	batch_size, hidden_dim

model.utils.read_corpus(lines)[source]¶: convert corpus into features and labels

model.utils.read_features(lines, multi_docs=True)[source]¶: convert un-annotated corpus into features

model.utils.revlut(lut)[source]¶

model.utils.save_checkpoint(state, track_list, filename)[source]¶: save checkpoint

model.utils.shrink_embedding(feature_map, word_dict, word_embedding, caseless)[source]¶: shrink embedding dictionary to in-doc words only

model.utils.shrink_features(feature_map, features, thresholds)[source]¶: filter un-common features by threshold

model.utils.switch(vec1, vec2, mask)[source]¶

switch function for pytorch

Parameters:	vec1 (any size) – input tensor corresponding to 0 vec2 (same to vec1) – input tensor corresponding to 1 mask (same to vec1) – input tensor, each element equals to 0/1
Returns:	vec (*)

model.utils.to_scalar(var)[source]¶: change the first element of a tensor to scalar

model package¶

Submodules¶

model.crf module¶

model.evaluator module¶

model.highway module¶

model.lm_lstm_crf module¶

model.lstm_crf module¶

model.ner_dataset module¶

model.utils module¶

Module contents¶