BERT의 마스크드 언어 모델(Masked Language Model)

Notice

Recent Posts

Recent Comments

Link

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Zorba blog

BERT의 마스크드 언어 모델(Masked Language Model) 본문

자연어처리

BERT의 마스크드 언어 모델(Masked Language Model)

Zorba blog 2022. 6. 8. 10:29

구글 BERT의 마스크드 언어 모델

1. 마스크드 언어 모델과 토크나이저

- BERT는 이미 누군가가 학습해둔 모델을 사용하는 것이므로 우리가 사용하는 모델과 토크나이저는 항상 맵핑 관계.

- 아래 코드를 통해 마스크드 언어 모델과 토크나이저를 로드

from transformers import TFBertForMaskedLM
from transformers import AutoTokenizer

model = TFBertForMaskedLM.from_pretrained('bert-large-uncased')
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased")

2. BERT의 입력

- "Soccer is a really fun [MASK]" 라는 문장을 마스크드 언어 모델의 입력으로 넣으면, 마스크드 언어 모델은 [MASK]의 위치에 해당하는 단어를 예측.

- 토크나이저를 사용하여 해당 문장을 정수 인코딩. input_ids를 통해 정수 인코딩 결과를 확인.

inputs = tokenizer('Soccer is a really fun [MASK].', return_tensors='tf')

print(inputs['input_ids'])
tf.Tensor([[ 101 4715 2003 1037 2428 4569  103 1012  102]], shape=(1, 9), dtype=int32)

- 토크나이저로 변환된 결과에서 token_type_ids를 통해서 문장을 구분하는 세그먼트 인코딩 결과를 확인.

- 현재의 문장이 한 개이므로 여기서는 문장 길이만큼 0 시퀀스를 얻음.

print(inputs['token_type_ids'])
tf.Tensor([[0 0 0 0 0 0 0 0 0]], shape=(1, 9), dtype=int32)

print(inputs['attention_mask'])
tf.Tensor([[1 1 1 1 1 1 1 1 1]], shape=(1, 9), dtype=int32)

- attention_mask를 통해서 실제 단어와 패딩 토큰을 구분하는 용도인 어텐션 마스크를 확인.

- 현재의 입력에서는 패딩이 없으므로 여기서는 문장 길이 만큼의 1 시퀀스를 얻음.

- 만약 뒤에 패딩이 있었다면 패딩이 시작되는 구간부터는 0의 시퀀스가 나오게 됨.

3. [MASK] 토큰 예측하기

- FillMaskPipeline은 모델과 토크나이저를 지정하면 손쉽게 마스크드 언어 모델의 예측 결과를 정리해서 보여줌.

- FillMaskPipeline에 앞서 불러온 모델과 토크나이저를 지정.

from transformers import TFBertForMaskedLM
from transformers import AutoTokenizer

model = TFBertForMaskedLM.from_pretrained('bert-large-uncased')
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased")

from transformers import FillMaskPipeline
pip = FillMaskPipeline(model=model, tokenizer=tokenizer)

- pip를 통해 [MASK]의 위치에 들어갈 수 있는 상위 5개의 후보 단어들을 출력.

pip('Soccer is a really fun [MASK].')

[{'score': 0.762112021446228,
  'sequence': 'soccer is a really fun sport.',
  'token': 4368,
  'token_str': 'sport'},
 {'score': 0.2034197747707367,
  'sequence': 'soccer is a really fun game.',
  'token': 2208,
  'token_str': 'game'},
 {'score': 0.012208552099764347,
  'sequence': 'soccer is a really fun thing.',
  'token': 2518,
  'token_str': 'thing'},
 {'score': 0.0018630230333656073,
  'sequence': 'soccer is a really fun activity.',
  'token': 4023,
  'token_str': 'activity'},
 {'score': 0.001335485139861703,
  'sequence': 'soccer is a really fun field.',
  'token': 2492,
  'token_str': 'field'}]

한국어 BERT의 마스크드 언어 모델