1장. Pre-training | 사전학습 (1)

p. 1

트랜스포머(Transformer)^[각주:1]와 같은 신경망 기반 시퀀스 모델(neural sequence model)의 발전과 대규모 자기지도 학습(large-scale self-supervised learning) 기법의 향상은 범용적인 언어 이해 및 생성의 가능성을 열었다. 이러한 성과는 사전학습(pre-training)에 의해 촉진되었는데, 이는 다양한 신경망 기반 시스템에서 공통 요소를 분리한 뒤, 라벨링 되지 않은 대규모 데이터(huge amounts of unlabeled data)에 대해 자기 지도 기법으로 학습시키는 방식이다.

이렇게 학습된 모델은 기초모델(foundation model)로서, 파인튜닝(fine-tuning)이나 프롬프트(prompting)만으로도 다양한 작업에 쉽게 적용될 수 있다. 그 결과, 자연어처리(NLP)의 전반적인 패러다임이 크게 바뀌었다. 이제 많은 경우에 있어, 특정 작업에 대해 별도의 대규모 지도학습이 필요하지 않으며, 대신 사전학습된 기초 모델을 적절히 활용하는 것으로 충분한 성능을 얻을 수 있게 되었다.

사전학습은 특히 최신 NLP 연구에서 주목받고 있으나, 그 기원은 딥러닝 초기 시절로 거슬러 올라간다. RNN, 심층 순방향 신경망(deep feedforward networks), 오토인코더(autoencoders) 등에 대한 비지도 학습(unsupervised learning) 기반의 사전학습 시도^[각주:2]가 그 예시이다.

현대 딥러닝 시대에 들어서는, 다양한 워드 임베딩(word embedding) 모델^[각주:3]^[각주:4]에 대한 대규모 비지도 학습을 계기로 사전학습이 다시 주목받기 시작했다.

같은 시기에 사전학습 기법은 컴퓨터 비전(computer vision) 분야에서도 관심을 불러일으키고 있는데, 예를 들어, ImageNet과 같은 비교적 대규모로 라벨링 된 데이터셋을 이용해 백본 모델(backbone model)을 학습시키고, 이를 다양한 후속 작업(downstream task)에 적용하는 방식이 활용되고 있다^[각주:5]^[각주:6].

NLP 분야에서의 대규모 사전학습 연구는 자기지도 학습을 기반으로 한 언어 모델 개발을 통해 본격화되었다. 이러한 모델군에는 BERT^[각주:7]나 GPT^[각주:8] 등과 같이 잘 알려진 사례들이 포함되는데, 이런 모델들은 다음과 같은 공통된 아이디어를 공유한다:

“방대한 양의 텍스트에서 마스킹된 단어를 예측하도록 모델을 학습시키면, 일반적인 언어 이해와 생성 능력을 얻을 수 있다.”

이러한 방식은 일견 구조상 단순해 보일 수 있으나, 실제로 학습된 모델들은 언어 구조를 정교하게 모델링하는 능력을 보여준다. 비록 언어 구조를 명시적으로 학습하지는 않았지만, 사전학습 작업의 범용성(generality) 덕분에, 이러한 모델들은 다양한 NLP 문제에서 이전의 정교한 지도학습 시스템을 능가하는 성능을 보이기도 한다.

더 최근에는, 사전학습된 대규모 언어 모델(pre-trained large language models, pre-trained LLM)이 더 큰 성공을 거두며, 범용 인공지능(general artificial intelligence)으로의 발전 가능성을 보여주고 있다 ^[각주:9].

본 장에서는 NLP 맥락에서의 사전학습 개념을 다룬다.

먼저 사전학습 기법과 그 응용 방법을 간략히 소개하고, BERT를 예시로 들어 마스킹 언어 모델링(masked language modeling)이라는 자기 지도 태스크(self-supervised task)를 통해 시퀀스 모델(sequence model)이 어떻게 학습되는지를 설명한다.

이후에는 사전학습된 시퀀스 모델을 다양한 NLP 작업에 어떻게 적응시키는지에 대한 논의로 이어진다.

※ 주의 ※
본 장에서는 NLP에서의 사전학습 패러다임에 중점을 두고 논의됩니다.
따라서, 생성형 대규모 언어 모델(generative large language model)에 대한 세세한 내용은 다루지 않습니다.
이러한 모델들에 대한 자세한 논의는 이후 장에서 다룰 예정입니다.

📎 참고 문헌

저자: Tong Xiao, Jingbo Zhu
출판일: 2025년 1월 16일
라이선스: CC BY-NC 4.0 (비상업적 이용 허용)
원문 링크: https://arxiv.org/abs/2501.09223

이 글은 “Foundations of Large Language Models” (Tong Xiao, Jingbo Zhu, arXiv:2501.09223, 2025년 1월 16일 제출)을 기반으로 작성되었습니다.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems, volume 30, 2017. [본문으로]
Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015. [본문으로]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, pages 3111–3119, 2013b. [본문으로]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. [본문으로]
Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4918–4927, 2019. [본문으로]
Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin Dogus Cubuk, and Quoc Le. Rethinking pre-training and self-training. Advances in neural information processing systems, 33:3833–3845, 2020. [본문으로]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019. [본문으로]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020. [본문으로]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Túlio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. [본문으로]

저작자표시 비영리 변경금지 (새창열림)

'AI(인공지능) > Foundations of Large Language Models' 카테고리의 다른 글

표기법 (Notation) (0)	2025.05.09
소개글 (Introduction) (0)	2025.05.07

청신이의 주절주절

1장. Pre-training | 사전학습 (1)

📎 참고 문헌

'AI(인공지능) > Foundations of Large Language Models' 카테고리의 다른 글

티스토리툴바

1장. Pre-training | 사전학습 (1)

📎 참고 문헌

'AI(인공지능) > Foundations of Large Language Models' 카테고리의 다른 글

관련글

티스토리툴바