AI 멀티모달 임베딩(Multimodal Embedding) 완벽 가이드

업데이트: March 27, 2026

멀티모달 임베딩이란?

멀티모달 임베딩(Multimodal Embedding)은 텍스트, 이미지, 오디오, 비디오 등 여러 종류의 데이터(모달리티)를 하나의 공통된 벡터 공간(벡터 표현)으로 변환하는 딥러닝 기술입니다.

이를 통해 서로 다른 모달리티 간의 의미론적 관계를 수치적으로 표현할 수 있으며, 이미지 텍스트 매칭, 크로스 모달 검색, 시각 질문 응답 등 다양한 작업에 활용됩니다.

멀티모달 학습의 필요성

현실의 데이터는 다양한 형태로 존재합니다:

이미지 캡셔닝: 이미지만으로는 표현하기 어려운 정보를 텍스트로 설명
교차 검색: “강아지”라는 단어로 강아지 이미지를 검색
의미 통일: 같은 의미를 텍스트와 이미지로 다르게 표현해도 동일하게 인식

멀티모달 임베딩은 이러한 이질적인 데이터 간의 의미를 통일된 벡터 공간에서 표현함으로써, 더욱 강력한 AI 모델을 만들 수 있게 합니다.

임베딩(Embedding)의 기초

임베딩이란?

임베딩은 높은 차원의 데이터(예: 이미지, 텍스트)를 더 낮은 차원의 벡터로 변환하는 과정입니다.

\[\text{Embedding: } \mathbb{R}^{\text{high}} \rightarrow \mathbb{R}^{\text{low}}\]

예를 들어:

텍스트 임베딩: 문장 → 512차원 벡터
이미지 임베딩: 이미지(1024×1024×3) → 512차원 벡터
오디오 임베딩: 음성 신호 → 256차원 벡터

임베딩의 특징

의미 보존: 의미적으로 유사한 데이터는 벡터 공간에서도 가까움
차원 축소: 계산 효율성 증대
연속성: 범주형 데이터를 연속 벡터로 표현하여 수학적 연산 가능

코사인 유사도(Cosine Similarity)

임베딩 벡터 간의 유사도는 보통 코사인 유사도로 측정됩니다:

\[\text{similarity}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| \cdot ||\mathbf{v}||}\]

값의 범위는 -1 ~ 1이며, 1에 가까울수록 두 벡터가 유사합니다.

멀티모달 정렬(Alignment)

Alignment란?

정렬(Alignment)은 서로 다른 모달리티의 임베딩 벡터들이 같은 의미를 관련 공간에 배치하도록 정렬하는 과정입니다.

예를 들어:

“귀여운 강아지” (텍스트)
강아지 사진 (이미지)

이 두 데이터는 서로 다른 형태이지만, 벡터 공간에서 가까이 배치되어야 합니다.

Alignment 방법

1. 대조 학습(Contrastive Learning)

가장 널리 사용되는 방법으로, 같은 의미의 쌍(positive pair)끼리는 가깝게, 다른 쌍(negative pair)끼리는 멀게 배치합니다.

\[\mathcal{L} = -\log \frac{\exp(\text{sim}(\mathbf{u}_i, \mathbf{v}_i) / \tau)}{\sum_j \exp(\text{sim}(\mathbf{u}_i, \mathbf{v}_j) / \tau)}\]

2. 삼중항 손실(Triplet Loss)

앵커(anchor), 긍정(positive), 부정(negative) 3개의 샘플을 사용:

\[\mathcal{L} = \max(0, d(\mathbf{a}, \mathbf{p}) - d(\mathbf{a}, \mathbf{n}) + \text{margin})\]

이미지를 기반으로 텍스트를 검색하거나, 그 반대로도 효율적으로 학습합니다.

주요 멀티모달 모델

1. CLIP (Contrastive Language-Image Pre-training)

개발: OpenAI (2021)
특징: 대조 학습을 통해 이미지-텍스트 정렬

CLIP의 동작 원리

이미지 인코더 (Vision Transformer): 이미지 → 이미지 임베딩
텍스트 인코더 (Transformer): 텍스트 → 텍스트 임베딩
정렬: 시그모이드 된 코사인 유사도를 이용해 대조 손실 최소화

CLIP 사용 예제

import torch
import clip
from PIL import Image

# 디바이스 선택
device = "cuda" if torch.cuda.is_available() else "cpu"

# 모델 로드
model, preprocess = clip.load("ViT-B/32", device=device)

# 이미지 전처리
image = preprocess(Image.open("dog.jpg")).unsqueeze(0).to(device)

# 텍스트 토큰화
text_labels = clip.tokenize(["a dog", "a cat", "a bird"]).to(device)

# 특징 추출
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text_labels)
    
    # 정규화
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # 유사도 계산
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
print(similarity)  # 각 레이블과의 유사도

2. BLIP (Bootstrapping Language-Image Pre-training)

개발: Salesforce Research (2022)
특징:

Vision Transformer와 사전학습된 언어 모델을 결합
이미지 캡셔닝과 이미지-텍스트 매칭을 동시에 학습

BLIP의 장점

더 강한 Cross-Modal 이해: 양방향 정렬 강화
이미지 캡셔닝: 이미지로부터 자동으로 캡션 생성
Visual Question Answering (VQA): 이미지에 대한 질문에 답변

3. ALIGN (Aligned Latent Representations)

개발: Google (2021)
특징: 매우 큰 규모의 약하게 감독되는 이미지-텍스트 쌍 데이터로 학습

웹 규모의 데이터 활용으로 뛰어난 성능
듀얼 인코더 아키텍처

4. ViLBERT (Vision and Language BERT)

개발: Facebook AI Research (2019)
특징:

두 개의 독립적인 BERT 모델 사용 (비전 모달, 언어 모달)
Co-attention layers로 모달리티 간 상호작용

5. LLaVA (Large Language and Vision Assistant)

개발: Microsoft Research (2023)
특징:

Vision Transformer + 대규모 언어 모델(Vicuna) 결합
시각적 질문 응답에 특화
Fine-tuning으로 새로운 작업에 빠르게 적응

임베딩 모델 포맷: ONNX & 최적화

ONNX (Open Neural Network Exchange)

ONNX는 다양한 딥러닝 프레임워크 간의 모델 호환성을 제공하는 개방형 표준입니다.

ONNX의 이점

프레임워크 독립성: PyTorch → TensorFlow 변환 가능
배포 용이성: 브라우저, 모바일, 엣지 디바이스에서 실행
성능 최적화: ONNX Runtime을 통한 빠른 추론

ONNX로 CLIP 모델 내보내기

import torch
import clip
import onnx

# 모델 로드
model, preprocess = clip.load("ViT-B/32", device="cpu")
model.eval()

# 더미 입력
dummy_image = torch.randn(1, 3, 224, 224)
dummy_text = torch.randint(0, 49408, (1, 77))

# ONNX로 내보내기
torch.onnx.export(
    model,
    (dummy_image, dummy_text),
    "clip_model.onnx",
    input_names=["image", "text"],
    output_names=["logits"],
    dynamic_axes={
        "image": {0: "batch_size"},
        "text": {0: "batch_size"}
    },
    opset_version=14
)

print("ONNX 모델 저장 완료")

TensorRT & TITAN

TensorRT (NVIDIA)

GPU 최적화 추론 엔진
ONNX 모델을 TensorRT 포맷으로 변환하여 10배 이상의 성능 향상
양자화(Quantization)를 통해 메모리 사용량 감소

TITAN (텐서 분해 기반 최적 모델)

대규모 멀티모달 모델의 매개변수 효율적 학습
낮은 계산량으로도 높은 성능 유지

활용 사례와 실제 코드

사례 1: 이미지 검색 시스템

import torch
import clip
from PIL import Image
import numpy as np
from pathlib import Path

# 모델 로드
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 이미지 임베딩 생성
image_dir = "./images"
image_features_list = []
image_names = []

for img_path in Path(image_dir).glob("*.jpg"):
    image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
    
    with torch.no_grad():
        features = model.encode_image(image)
        features /= features.norm(dim=-1, keepdim=True)
        image_features_list.append(features.cpu().numpy())
        image_names.append(img_path.name)

image_features = np.vstack(image_features_list)

# 텍스트 쿼리로 검색
query = "풍경과 산이 있는 사진"
text_input = clip.tokenize([query]).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_input)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# 유사도 계산
similarities = (100.0 * text_features.cpu().numpy() @ image_features.T)
top_indices = np.argsort(-similarities[0])[:5]

print("검색 결과:")
for idx in top_indices:
    print(f"  {image_names[idx]} (유사도: {similarities[0][idx]:.2f})")

사례 2: Zero-Shot 분류

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 이미지 로드
image = preprocess(Image.open("animal.jpg")).unsqueeze(0).to(device)

# 분류 레이블 정의
labels = ["개", "고양이", "새", "말", "양"]
label_prompts = [f"이것은 {label}입니다" for label in labels]

# 텍스트 특징 추출
text_input = clip.tokenize(label_prompts).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text_input)
    
    # 정규화
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # 확률 계산
    logits = (100.0 * image_features @ text_features.T).softmax(dim=-1)

# 결과 출력
for i, label in enumerate(labels):
    print(f"{label}: {logits[0][i]*100:.2f}%")

사례 3: 이미지-텍스트 매칭

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 이미지-캡션 쌍 데이터
image_caption_pairs = [
    ("sunset.jpg", "아름다운 일몰"),
    ("beach.jpg", "모래사장과 바다"),
    ("forest.jpg", "우거진 숲"),
]

# 모든 이미지와 캡션의 임베딩 계산
image_features_list = []
text_features_list = []

for img_path, caption in image_caption_pairs:
    image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
    text_input = clip.tokenize([caption]).to(device)
    
    with torch.no_grad():
        img_features = model.encode_image(image)
        txt_features = model.encode_text(text_input)
        
        img_features /= img_features.norm(dim=-1, keepdim=True)
        txt_features /= txt_features.norm(dim=-1, keepdim=True)
        
        image_features_list.append(img_features)
        text_features_list.append(txt_features)

# 유사도 행렬 계산
image_features = torch.vstack(image_features_list)
text_features = torch.vstack(text_features_list)

similarity_matrix = 100.0 * image_features @ text_features.T

print("이미지-캡션 유사도 행렬:")
print(similarity_matrix)

# 매칭된 쌍 확인 (대각선이 높을수록 좋음)
for i in range(len(image_caption_pairs)):
    print(f"쌍 {i}: {similarity_matrix[i][i]:.2f}")

성능 최적화 팁

1. 배치 처리

# 여러 이미지를 한번에 처리
batch_images = [preprocess(Image.open(f"img_{i}.jpg")) for i in range(32)]
batch_images = torch.stack(batch_images).to(device)

with torch.no_grad():
    batch_features = model.encode_image(batch_images)

2. 양자화(Quantization)

# FP32 → INT8로 양자화
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

3. 캐싱

# 텍스트 임베딩 캐싱
text_embedding_cache = {}

def get_text_embedding(text):
    if text not in text_embedding_cache:
        with torch.no_grad():
            tokens = clip.tokenize([text]).to(device)
            embedding = model.encode_text(tokens)
            text_embedding_cache[text] = embedding
    return text_embedding_cache[text]

한계 및 향후 방향

현재의 한계

데이터 의존성: 대규모 고품질 멀티모달 데이터셋 필요
계산 비용: 훈련 및 추론 시 많은 계산 자원 필요
편향 문제: 훈련 데이터의 편향이 모델에 반영될 수 있음
3D 및 시계열 데이터: 이미지-텍스트에 중점, 다른 모달리티 부족

향후 발전 방향

더 많은 모달리티 통합: 비디오, 오디오, 시계열 데이터
효율성 개선: 더 가볍고 빠른 모델 개발
설명 가능성: 의사결정 과정의 투명성 증대
도메인 특화 모델: 의료, 자율주행, 제조 등 특정 분야 최적화

참조

링크	설명
CLIP (OpenAI)	CLIP 모델의 공식 GitHub 저장소
CLIP 논문	“Learning Transferable Models for Composed Image Retrieval” 논문
BLIP	BLIP 모델 공식 GitHub 저장소
Wikipedia - Multimodal Learning	멀티모달 학습에 대한 종합 설명
Vision Language Models	비전 언어 모델 설명
Hugging Face CLIP	Hugging Face에서 제공하는 CLIP 모델
OpenCLIP	독립적으로 훈련된 더 큰 CLIP 모델들
PyTorch ONNX Export	PyTorch에서 ONNX로 모델 변환 가이드
ONNX Runtime	ONNX 모델 실행 엔진
TensorRT Documentation	NVIDIA TensorRT 최적화 가이드
Large Multimodal Models (Unite.ai)	최신 멀티모달 LLM 동향 분석

Kim