LM-LSTM-CRF documentation¶
This project provides high-performance character-aware sequence labeling tools and tutorials. The implementation is based on the PyTorch library.
LM-LSTM-CRF achieves F1 score of 91.71+/-0.10 on CoNLL03 NER task, without using any additional corpus.
model package¶
Submodules¶
model.crf module¶
-
class
model.crf.
CRFDecode_vb
(tagset_size, start_tag, end_tag, average_batch=True)[source]¶ Bases:
object
Batch-mode viterbi decode
Parameters: - tagset_size – target_set_size
- start_tag – ind for <start>
- end_tag – ind for <pad>
- average_batch – whether average the loss among batch
-
class
model.crf.
CRFLoss_gd
(tagset_size, start_tag, end_tag, average_batch=True)[source]¶ Bases:
torch.nn.modules.module.Module
loss for greedy decode loss, i.e., although its for CRF Layer, we calculate the loss as
\[\sum_{j=1}^n \log (p(\hat{y}_{j+1}|z_{j+1}, \hat{y}_{j}))\]instead of
\[\sum_{j=1}^n \log (\phi(\hat{y}_{j-1}, \hat{y}_j, \mathbf{z}_j)) - \log (\sum_{\mathbf{y}' \in \mathbf{Y}(\mathbf{Z})} \prod_{j=1}^n \phi(y'_{j-1}, y'_j, \mathbf{z}_j) )\]Parameters: - tagset_size – target_set_size
- start_tag – ind for <start>
- end_tag – ind for <pad>
- average_batch – whether average the loss among batch
-
class
model.crf.
CRFLoss_vb
(tagset_size, start_tag, end_tag, average_batch=True)[source]¶ Bases:
torch.nn.modules.module.Module
loss for viterbi decode
\[\sum_{j=1}^n \log (\phi(\hat{y}_{j-1}, \hat{y}_j, \mathbf{z}_j)) - \log (\sum_{\mathbf{y}' \in \mathbf{Y}(\mathbf{Z})} \prod_{j=1}^n \phi(y'_{j-1}, y'_j, \mathbf{z}_j) )\]Parameters: - tagset_size – target_set_size
- start_tag – ind for <start>
- end_tag – ind for <pad>
- average_batch – whether average the loss among batch
-
class
model.crf.
CRFRepack
(tagset_size, if_cuda)[source]¶ Bases:
object
Packer for word level model
Parameters: - tagset_size – target_set_size
- if_cuda – whether use GPU
-
convert_for_eval
(target)[source]¶ convert target to original decoding
Parameters: target – input labels used in training Returns: output labels used in test
-
repack_gd
(feature, target, current)[source]¶ packer for greedy loss
Parameters: - feature (Seq_len, Batch_size) – input feature
- target (Seq_len, Batch_size) – output target
- current (Seq_len, Batch_size) – current state
Returns: feature (Seq_len, Batch_size), target (Seq_len * Batch_size), current (Seq_len * Batch_size, 1, 1)
-
repack_vb
(feature, target, mask)[source]¶ packer for viterbi loss
Parameters: - feature (Seq_len, Batch_size) – input feature
- target (Seq_len, Batch_size) – output target
- mask (Seq_len, Batch_size) – padding mask
Returns: feature (Seq_len, Batch_size), target (Seq_len, Batch_size), mask (Seq_len, Batch_size)
-
class
model.crf.
CRFRepack_WC
(tagset_size, if_cuda)[source]¶ Bases:
object
Packer for model with char-level and word-level
Parameters: - tagset_size – target_set_size
- if_cuda – whether use GPU
-
convert_for_eval
(target)[source]¶ convert for eval
Parameters: target – input labels used in training Returns: output labels used in test
-
repack_vb
(f_f, f_p, b_f, b_p, w_f, target, mask, len_b)[source]¶ packer for viterbi loss
Parameters: - f_f (Char_Seq_len, Batch_size) – forward_char input feature
- f_p (Word_Seq_len, Batch_size) – forward_char input position
- b_f (Char_Seq_len, Batch_size) – backward_char input feature
- b_p (Word_Seq_len, Batch_size) – backward_char input position
- w_f (Word_Seq_len, Batch_size) – input word feature
- target (Seq_len, Batch_size) – output target
- mask (Word_Seq_len, Batch_size) – padding mask
- len_b (Batch_size, 2) – length of instances in one batch
Returns: f_f (Char_Reduced_Seq_len, Batch_size), f_p (Word_Reduced_Seq_len, Batch_size), b_f (Char_Reduced_Seq_len, Batch_size), b_p (Word_Reduced_Seq_len, Batch_size), w_f (size Word_Seq_Len, Batch_size), target (Reduced_Seq_len, Batch_size), mask (Word_Reduced_Seq_len, Batch_size)
-
class
model.crf.
CRF_L
(hidden_dim, tagset_size, if_bias=True)[source]¶ Bases:
torch.nn.modules.module.Module
Conditional Random Field (CRF) layer. This version is used in Ma et al. 2016, has more parameters than CRF_S
Parameters: - hidden_dim – input dim size
- tagset_size – target_set_size
- if_biase – whether allow bias in linear trans
-
class
model.crf.
CRF_S
(hidden_dim, tagset_size, if_bias=True)[source]¶ Bases:
torch.nn.modules.module.Module
Conditional Random Field (CRF) layer. This version is used in Lample et al. 2016, has less parameters than CRF_L.
Parameters: - hidden_dim – input dim size
- tagset_size – target_set_size
- if_biase – whether allow bias in linear trans
model.evaluator module¶
-
class
model.evaluator.
eval_batch
(packer, l_map)[source]¶ Bases:
object
Base class for evaluation, provide method to calculate f1 score and accuracy
Parameters: - packer – provide method to convert target into original space [TODO: need to improve]
- l_map – dictionary for labels
-
calc_acc_batch
(decoded_data, target_data)[source]¶ update statics for accuracy
Parameters: - decoded_data (batch_size, seq_len) – prediction sequence
- target_data (batch_size, seq_len) – ground-truth
-
calc_f1_batch
(decoded_data, target_data)[source]¶ update statics for f1 score
Parameters: - decoded_data (batch_size, seq_len) – prediction sequence
- target_data (batch_size, seq_len) – ground-truth
-
class
model.evaluator.
eval_w
(packer, l_map, score_type)[source]¶ Bases:
model.evaluator.eval_batch
evaluation class for word level model (LSTM-CRF)
Parameters: - packer – provide method to convert target into original space [TODO: need to improve]
- l_map – dictionary for labels
- score_type – use f1score with using ‘f’
-
class
model.evaluator.
eval_wc
(packer, l_map, score_type)[source]¶ Bases:
model.evaluator.eval_batch
evaluation class for LM-LSTM-CRF
Parameters: - packer – provide method to convert target into original space [TODO: need to improve]
- l_map – dictionary for labels
- score_type – use f1score with using ‘f’
model.highway module¶
-
class
model.highway.
hw
(size, num_layers=1, dropout_ratio=0.5)[source]¶ Bases:
torch.nn.modules.module.Module
Highway layers
Parameters: - size – input and output dimension
- dropout_ratio – dropout ratio
model.lm_lstm_crf module¶
-
class
model.lm_lstm_crf.
LM_LSTM_CRF
(tagset_size, char_size, char_dim, char_hidden_dim, char_rnn_layers, embedding_dim, word_hidden_dim, word_rnn_layers, vocab_size, dropout_ratio, large_CRF=True, if_highway=False, in_doc_words=2, highway_layers=1)[source]¶ Bases:
torch.nn.modules.module.Module
LM_LSTM_CRF model
Parameters: - tagset_size – size of label set
- char_size – size of char dictionary
- char_dim – size of char embedding
- char_hidden_dim – size of char-level lstm hidden dim
- char_rnn_layers – number of char-level lstm layers
- embedding_dim – size of word embedding
- word_hidden_dim – size of word-level blstm hidden dim
- word_rnn_layers – number of word-level lstm layers
- vocab_size – size of word dictionary
- dropout_ratio – dropout ratio
- large_CRF – use CRF_L or not, refer model.crf.CRF_L and model.crf.CRF_S for more details
- if_highway – use highway layers or not
- in_doc_words – number of words that occurred in the corpus (used for language model prediction)
- highway_layers – number of highway layers
-
forward
(forw_sentence, forw_position, back_sentence, back_position, word_seq, hidden=None)[source]¶ Parameters: - forw_sentence (char_seq_len, batch_size) – char-level representation of sentence
- forw_position (word_seq_len, batch_size) – position of blank space in char-level representation of sentence
- back_sentence (char_seq_len, batch_size) – char-level representation of sentence (inverse order)
- back_position (word_seq_len, batch_size) – position of blank space in inversed char-level representation of sentence
- word_seq (word_seq_len, batch_size) – word-level representation of sentence
- hidden – initial hidden state
Returns: crf output (word_seq_len, batch_size, tag_size, tag_size), hidden
-
load_pretrained_word_embedding
(pre_word_embeddings)[source]¶ load pre-trained word embedding
Parameters: pre_word_embeddings (self.word_size, self.word_dim) – pre-trained embedding
-
rand_init
(init_char_embedding=True, init_word_embedding=False)[source]¶ random initialization
Parameters: - init_char_embedding – random initialize char embedding or not
- init_word_embedding – random initialize word embedding or not
-
word_pre_train_backward
(sentence, position, hidden=None)[source]¶ output of backward language model
Parameters: - sentence (char_seq_len, batch_size) – char-level representation of sentence (inverse order)
- position (word_seq_len, batch_size) – position of blank space in inversed char-level representation of sentence
- hidden – initial hidden state
Returns: language model output (word_seq_len, in_doc_word), hidden
-
word_pre_train_forward
(sentence, position, hidden=None)[source]¶ output of forward language model
Parameters: - sentence (char_seq_len, batch_size) – char-level representation of sentence
- position (word_seq_len, batch_size) – position of blank space in char-level representation of sentence
- hidden – initial hidden state
Returns: language model output (word_seq_len, in_doc_word), hidden
model.lstm_crf module¶
-
class
model.lstm_crf.
LSTM_CRF
(vocab_size, tagset_size, embedding_dim, hidden_dim, rnn_layers, dropout_ratio, large_CRF=True)[source]¶ Bases:
torch.nn.modules.module.Module
LSTM_CRF model
Parameters: - vocab_size – size of word dictionary
- tagset_size – size of label set
- embedding_dim – size of word embedding
- hidden_dim – size of word-level blstm hidden dim
- rnn_layers – number of word-level lstm layers
- dropout_ratio – dropout ratio
- large_CRF – use CRF_L or not, refer model.crf.CRF_L and model.crf.CRF_S for more details
-
forward
(sentence, hidden=None)[source]¶ Parameters: - sentence (word_seq_len, batch_size) – word-level representation of sentence
- hidden – initial hidden state
Returns: crf output (word_seq_len, batch_size, tag_size, tag_size), hidden
-
load_pretrained_embedding
(pre_embeddings)[source]¶ load pre-trained word embedding
Parameters: pre_word_embeddings (self.word_size, self.word_dim) – pre-trained embedding
-
rand_init
(init_embedding=False)[source]¶ random initialization
Parameters: init_embedding – random initialize embedding or not
random initialize hidden variable
model.ner_dataset module¶
-
class
model.ner_dataset.
CRFDataset
(data_tensor, label_tensor, mask_tensor)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Dataset Class for word-level model
Parameters: - data_tensor (ins_num, seq_length) – words
- label_tensor (ins_num, seq_length) – labels
- mask_tensor (ins_num, seq_length) – padding masks
-
class
model.ner_dataset.
CRFDataset_WC
(forw_tensor, forw_index, back_tensor, back_index, word_tensor, label_tensor, mask_tensor, len_tensor)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Dataset Class for char-aware model
Parameters: - forw_tensor (ins_num, seq_length) – forward chars
- forw_index (ins_num, seq_length) – index of forward chars
- back_tensor (ins_num, seq_length) – backward chars
- back_index (ins_num, seq_length) – index of backward chars
- word_tensor (ins_num, seq_length) – words
- label_tensor (ins_num, seq_length) – labels:
- mask_tensor (ins_num, seq_length) – padding masks
- len_tensor (ins_num, 2) – length of chars (dim0) and words (dim1)
model.utils module¶
-
model.utils.
concatChar
(input_lines, char_dict)[source]¶ concat char into string
Parameters: - input_lines (list of list of char) – input corpus
- char_dict (dictionary) – char-level dictionary
Returns: forw_lines
-
model.utils.
construct_bucket_gd
(input_features, input_labels, thresholds, pad_feature, pad_label)[source]¶ Construct bucket by thresholds for greedy decode, word-level only
-
model.utils.
construct_bucket_mean_gd
(input_features, input_label, word_dict, label_dict)[source]¶ Construct bucket by mean for greedy decode, word-level only
-
model.utils.
construct_bucket_mean_vb
(input_features, input_label, word_dict, label_dict, caseless)[source]¶ Construct bucket by mean for viterbi decode, word-level only
-
model.utils.
construct_bucket_mean_vb_wc
(word_features, input_label, label_dict, char_dict, word_dict, caseless)[source]¶ Construct bucket by mean for viterbi decode, word-level and char-level
-
model.utils.
construct_bucket_vb
(input_features, input_labels, thresholds, pad_feature, pad_label, label_size)[source]¶ Construct bucket by thresholds for viterbi decode, word-level only
-
model.utils.
construct_bucket_vb_wc
(word_features, forw_features, fea_len, input_labels, thresholds, pad_word_feature, pad_char_feature, pad_label, label_size)[source]¶ Construct bucket by thresholds for viterbi decode, word-level and char-level
-
model.utils.
encode
(input_lines, word_dict)[source]¶ encode list of strings into word-level representation
-
model.utils.
encode2Tensor
(input_lines, word_dict, unk)[source]¶ encode list of strings into word-level representation (tensor) with unk
-
model.utils.
encode2char_safe
(input_lines, char_dict)[source]¶ get char representation of lines
Parameters: - input_lines (list of strings) – input corpus
- char_dict (dictionary) – char-level dictionary
Returns: forw_lines
-
model.utils.
encode_corpus
(lines, f_map, l_map, if_lower=False)[source]¶ encode corpus into features and labels
-
model.utils.
encode_corpus_c
(lines, f_map, l_map, c_map)[source]¶ encode corpus into features (both word-level and char-level) and labels
-
model.utils.
encode_safe
(input_lines, word_dict, unk)[source]¶ encode list of strings into word-level representation with unk
-
model.utils.
find_length_from_feats
(feats, feat_to_ix)[source]¶ find length of unpadded features based on feature
-
model.utils.
find_length_from_labels
(labels, label_to_ix)[source]¶ find length of unpadded features based on labels
-
model.utils.
generate_corpus
(lines, if_shrink_feature=False, thresholds=1)[source]¶ generate label, feature, word dictionary and label dictionary
Parameters: - lines – corpus
- if_shrink_feature – whether shrink word-dictionary
- threshold – threshold for shrinking word-dictionary
-
model.utils.
generate_corpus_char
(lines, if_shrink_c_feature=False, c_thresholds=1, if_shrink_w_feature=False, w_thresholds=1)[source]¶ generate label, feature, word dictionary, char dictionary and label dictionary
Parameters: - lines – corpus
- if_shrink_c_feature – whether shrink char-dictionary
- c_threshold – threshold for shrinking char-dictionary
- if_shrink_w_feature – whether shrink word-dictionary
- w_threshold – threshold for shrinking word-dictionary
-
model.utils.
load_embedding
(emb_file, delimiter, feature_map, caseless, unk, shrink_to_train=False)[source]¶ load embedding
-
model.utils.
load_embedding_wlm
(emb_file, delimiter, feature_map, full_feature_set, caseless, unk, emb_len, shrink_to_train=False, shrink_to_corpus=False)[source]¶ load embedding, indoc words would be listed before outdoc words
Parameters: - emb_file – path to embedding file
- delimiter – delimiter of lines
- feature_map – word dictionary
- full_feature_set – all words in the corpus
- caseless – convert into casesless style
- unk – string for unknown token
- emb_len – dimension of embedding vectors
- shrink_to_train – whether to shrink out-of-training set or not
- shrink_to_corpus – whether to shrink out-of-corpus or not
-
model.utils.
log_sum_exp
(vec, m_size)[source]¶ calculate log of exp sum
Parameters: - vec (batch_size, vanishing_dim, hidden_dim) – input tensor
- m_size – hidden_dim
Returns: batch_size, hidden_dim
-
model.utils.
read_features
(lines, multi_docs=True)[source]¶ convert un-annotated corpus into features
-
model.utils.
shrink_embedding
(feature_map, word_dict, word_embedding, caseless)[source]¶ shrink embedding dictionary to in-doc words only
-
model.utils.
shrink_features
(feature_map, features, thresholds)[source]¶ filter un-common features by threshold