LLM 구성: vLLM + Qwen2.5-3B-Instruct
2025. 8. 20. 22:58ㆍk8s
yml 파일
apiVersion: v1
kind: PersistentVolume
metadata:
name: models-pv-3080
spec:
capacity: { storage: 200Gi }
volumeMode: Filesystem
accessModes: [ "ReadWriteOnce" ]
persistentVolumeReclaimPolicy: Retain
storageClassName: ssd-local
local:
path: /mnt/ssd/models
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: ["3080"]
# (동일 PV/PVC 재사용) + HF 캐시용 PVC 추가 권장
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: models-pvc
namespace: llm
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 100Gi
storageClassName: ssd-local
volumeName: models-pv-3080 # 같은 로컬 디스크 공유(원치 않으면 별도 PV 생성)
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen25
namespace: llm
spec:
replicas: 1
selector:
matchLabels: { app: vllm-qwen25 }
template:
metadata:
labels: { app: vllm-qwen25 }
spec:
nodeSelector:
kubernetes.io/hostname: "3080"
runtimeClassName: nvidia
# ▼ PVC 하위 경로 생성용 initContainer (선택이지만 권장)
initContainers:
- name: init-pvc-subpaths
image: busybox:1.36
command: ["sh","-c","mkdir -p /mnt/models /mnt/hf"]
volumeMounts:
- name: models
mountPath: /mnt
containers:
- name: vllm
image: vllm/vllm-openai:v0.5.4 # ★ latest 대신 고정 태그 권장
args:
# ❶ 12GB VRAM 안정: 3B(무양자) 또는 7B-AWQ(양자)
# - "--model=Qwen/Qwen2.5-3B-Instruct"
- "--model=Qwen/Qwen2.5-7B-Instruct-AWQ"
- "--quantization=awq" # 7B 양자화 사용 시
- "--device=cuda"
- "--dtype=auto"
- "--max-model-len=4096" # 보수적 컨텍스트로 OOM 완화
- "--gpu-memory-utilization=0.85"
- "--trust-remote-code"
- "--download-dir=/cache/hf" # HF 캐시 디렉토리
- "--port=8000"
env:
- name: HF_HOME
value: /cache/hf
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
- name: VLLM_LOGGING_LEVEL
value: "DEBUG"
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
ports:
- name: http
containerPort: 8000
resources:
requests:
cpu: "2"
memory: 6Gi
nvidia.com/gpu: "1"
limits:
cpu: "6"
memory: 10Gi
nvidia.com/gpu: "1"
volumeMounts:
- name: models
mountPath: /modelsi
subPath: models
- name: models
mountPath: /cache/hf
subPath: hf
volumes:
- name: models
persistentVolumeClaim:
claimName: models-pvc
---
apiVersion: v1
kind: Service
metadata:
name: vllm-openai
namespace: llm
spec:
selector: { app: vllm-qwen25 }
ports:
- port: 8000
targetPort: 8000
name: http
type: ClusterIP
확인
# 컨테이너 내부에서 GPU/경로 확인
kubectl -n llm exec deploy/vllm-qwen25 -- bash -lc 'ls -al /models; ls -al /cache/hf; nvidia-smi | head -n 12'
# 서비스 이름 확인 (예: vllm-openai)
kubectl -n llm get svc
# 로컬 8000 → 클러스터 서비스 8000으로 포워딩
kubectl -n llm port-forward svc/vllm-openai 8000:8000
# 응답에 Qwen/Qwen2.5-7B-Instruct-AWQ(또는 설정하신 모델)가 보이면 정상
curl http://127.0.0.1:8000/v1/models | jq .
# Chat Completions (cURL)
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "안녕하세요! 간단히 자기소개 해주세요."}
],
"temperature": 0.7,
"max_tokens": 256
}' | jq .
# phtyhon
# pip install requests sseclient-py
import json, requests
from sseclient import SSEClient
url = "http://127.0.0.1:8000/v1/chat/completions"
payload = {
"model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "한 줄 철학 명언 하나만!"}
],
"temperature": 0.7,
"max_tokens": 128,
"stream": True
}
resp = requests.post(url, json=payload, stream=True)
client = SSEClient(resp)
for event in client.events():
if event.data == "[DONE]":
break
chunk = json.loads(event.data)
delta = chunk["choices"][0]["delta"].get("content", "")
print(delta, end="", flush=True)
print()
'k8s' 카테고리의 다른 글
| K8s IP로 어느 Pod인지 찾기 (0) | 2025.08.30 |
|---|---|
| Let's encrypt Order/Challenge가 pending 일때 (0) | 2025.08.23 |
| pod 이름 조회해서 로그 출력하기 (0) | 2025.08.16 |
| pv의 CLAIM을 제거 (0) | 2025.08.13 |
| helm chart 내역 확인 (0) | 2025.08.13 |