LLM 구성: vLLM + Qwen2.5-3B-Instruct

2025. 8. 20. 22:58k8s

yml 파일

apiVersion: v1
kind: PersistentVolume
metadata:
  name: models-pv-3080
spec:
  capacity: { storage: 200Gi }
  volumeMode: Filesystem
  accessModes: [ "ReadWriteOnce" ]
  persistentVolumeReclaimPolicy: Retain
  storageClassName: ssd-local
  local:
    path: /mnt/ssd/models
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values: ["3080"]
# (동일 PV/PVC 재사용) + HF 캐시용 PVC 추가 권장
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: models-pvc
  namespace: llm
spec:
  accessModes: [ "ReadWriteOnce" ]
  resources:
    requests:
      storage: 100Gi
  storageClassName: ssd-local
  volumeName: models-pv-3080   # 같은 로컬 디스크 공유(원치 않으면 별도 PV 생성)
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen25
  namespace: llm
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-qwen25 }
  template:
    metadata:
      labels: { app: vllm-qwen25 }
    spec:
      nodeSelector:
        kubernetes.io/hostname: "3080"
      runtimeClassName: nvidia
      # ▼ PVC 하위 경로 생성용 initContainer (선택이지만 권장)
      initContainers:
      - name: init-pvc-subpaths
        image: busybox:1.36
        command: ["sh","-c","mkdir -p /mnt/models /mnt/hf"]
        volumeMounts:
        - name: models
          mountPath: /mnt
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.5.4        # ★ latest 대신 고정 태그 권장
        args:
          # ❶ 12GB VRAM 안정: 3B(무양자) 또는 7B-AWQ(양자)
          # - "--model=Qwen/Qwen2.5-3B-Instruct"
          - "--model=Qwen/Qwen2.5-7B-Instruct-AWQ"
          - "--quantization=awq"              # 7B 양자화 사용 시
          - "--device=cuda"
          - "--dtype=auto"
          - "--max-model-len=4096"            # 보수적 컨텍스트로 OOM 완화
          - "--gpu-memory-utilization=0.85"
          - "--trust-remote-code"
          - "--download-dir=/cache/hf"        # HF 캐시 디렉토리
          - "--port=8000"
        env:
          - name: HF_HOME
            value: /cache/hf
          - name: HF_HUB_ENABLE_HF_TRANSFER
            value: "1"
          - name: VLLM_LOGGING_LEVEL
            value: "DEBUG"
          - name: NVIDIA_VISIBLE_DEVICES
            value: "all"
          - name: NVIDIA_DRIVER_CAPABILITIES
            value: "compute,utility"
        ports:
          - name: http
            containerPort: 8000
        resources:
          requests:
            cpu: "2"
            memory: 6Gi
            nvidia.com/gpu: "1"
          limits:
            cpu: "6"
            memory: 10Gi
            nvidia.com/gpu: "1"
        volumeMounts:
          - name: models
            mountPath: /modelsi
            subPath: models
          - name: models
            mountPath: /cache/hf
            subPath: hf
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: models-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-openai
  namespace: llm
spec:
  selector: { app: vllm-qwen25 }
  ports:
    - port: 8000
      targetPort: 8000
      name: http
  type: ClusterIP

확인

# 컨테이너 내부에서 GPU/경로 확인
kubectl -n llm exec deploy/vllm-qwen25 -- bash -lc 'ls -al /models; ls -al /cache/hf; nvidia-smi | head -n 12'

# 서비스 이름 확인 (예: vllm-openai)
kubectl -n llm get svc

# 로컬 8000 → 클러스터 서비스 8000으로 포워딩
kubectl -n llm port-forward svc/vllm-openai 8000:8000

# 응답에 Qwen/Qwen2.5-7B-Instruct-AWQ(또는 설정하신 모델)가 보이면 정상
curl http://127.0.0.1:8000/v1/models | jq .

# Chat Completions (cURL)
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user",   "content": "안녕하세요! 간단히 자기소개 해주세요."}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }' | jq .
 
 
 
# phtyhon
# pip install requests sseclient-py
import json, requests
from sseclient import SSEClient

url = "http://127.0.0.1:8000/v1/chat/completions"
payload = {
    "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "한 줄 철학 명언 하나만!"}
    ],
    "temperature": 0.7,
    "max_tokens": 128,
    "stream": True
}
resp = requests.post(url, json=payload, stream=True)
client = SSEClient(resp)
for event in client.events():
    if event.data == "[DONE]":
        break
    chunk = json.loads(event.data)
    delta = chunk["choices"][0]["delta"].get("content", "")
    print(delta, end="", flush=True)
print()

 

'k8s' 카테고리의 다른 글

K8s IP로 어느 Pod인지 찾기  (0) 2025.08.30
Let's encrypt Order/Challenge가 pending 일때  (0) 2025.08.23
pod 이름 조회해서 로그 출력하기  (0) 2025.08.16
pv의 CLAIM을 제거  (0) 2025.08.13
helm chart 내역 확인  (0) 2025.08.13