LLM 구성: vLLM + Qwen2.5-3B-Instruct

LLM 구성: vLLM + Qwen2.5-3B-Instruct

2025. 8. 20. 22:58ㆍk8s

yml 파일

apiVersion: v1
kind: PersistentVolume
metadata:
  name: models-pv-3080
spec:
  capacity: { storage: 200Gi }
  volumeMode: Filesystem
  accessModes: [ "ReadWriteOnce" ]
  persistentVolumeReclaimPolicy: Retain
  storageClassName: ssd-local
  local:
    path: /mnt/ssd/models
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values: ["3080"]
# (동일 PV/PVC 재사용) + HF 캐시용 PVC 추가 권장
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: models-pvc
  namespace: llm
spec:
  accessModes: [ "ReadWriteOnce" ]
  resources:
    requests:
      storage: 100Gi
  storageClassName: ssd-local
  volumeName: models-pv-3080   # 같은 로컬 디스크 공유(원치 않으면 별도 PV 생성)
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen25
  namespace: llm
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-qwen25 }
  template:
    metadata:
      labels: { app: vllm-qwen25 }
    spec:
      nodeSelector:
        kubernetes.io/hostname: "3080"
      runtimeClassName: nvidia
      # ▼ PVC 하위 경로 생성용 initContainer (선택이지만 권장)
      initContainers:
      - name: init-pvc-subpaths
        image: busybox:1.36
        command: ["sh","-c","mkdir -p /mnt/models /mnt/hf"]
        volumeMounts:
        - name: models
          mountPath: /mnt
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.5.4        # ★ latest 대신 고정 태그 권장
        args:
          # ❶ 12GB VRAM 안정: 3B(무양자) 또는 7B-AWQ(양자)
          # - "--model=Qwen/Qwen2.5-3B-Instruct"
          - "--model=Qwen/Qwen2.5-7B-Instruct-AWQ"
          - "--quantization=awq"              # 7B 양자화 사용 시
          - "--device=cuda"
          - "--dtype=auto"
          - "--max-model-len=4096"            # 보수적 컨텍스트로 OOM 완화
          - "--gpu-memory-utilization=0.85"
          - "--trust-remote-code"
          - "--download-dir=/cache/hf"        # HF 캐시 디렉토리
          - "--port=8000"
        env:
          - name: HF_HOME
            value: /cache/hf
          - name: HF_HUB_ENABLE_HF_TRANSFER
            value: "1"
          - name: VLLM_LOGGING_LEVEL
            value: "DEBUG"
          - name: NVIDIA_VISIBLE_DEVICES
            value: "all"
          - name: NVIDIA_DRIVER_CAPABILITIES
            value: "compute,utility"
        ports:
          - name: http
            containerPort: 8000
        resources:
          requests:
            cpu: "2"
            memory: 6Gi
            nvidia.com/gpu: "1"
          limits:
            cpu: "6"
            memory: 10Gi
            nvidia.com/gpu: "1"
        volumeMounts:
          - name: models
            mountPath: /modelsi
            subPath: models
          - name: models
            mountPath: /cache/hf
            subPath: hf
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: models-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-openai
  namespace: llm
spec:
  selector: { app: vllm-qwen25 }
  ports:
    - port: 8000
      targetPort: 8000
      name: http
  type: ClusterIP

확인

# 컨테이너 내부에서 GPU/경로 확인
kubectl -n llm exec deploy/vllm-qwen25 -- bash -lc 'ls -al /models; ls -al /cache/hf; nvidia-smi | head -n 12'

# 서비스 이름 확인 (예: vllm-openai)
kubectl -n llm get svc

# 로컬 8000 → 클러스터 서비스 8000으로 포워딩
kubectl -n llm port-forward svc/vllm-openai 8000:8000

# 응답에 Qwen/Qwen2.5-7B-Instruct-AWQ(또는 설정하신 모델)가 보이면 정상
curl http://127.0.0.1:8000/v1/models | jq .

# Chat Completions (cURL)
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user",   "content": "안녕하세요! 간단히 자기소개 해주세요."}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }' | jq .
 
 
 
# phtyhon
# pip install requests sseclient-py
import json, requests
from sseclient import SSEClient

url = "http://127.0.0.1:8000/v1/chat/completions"
payload = {
    "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "한 줄 철학 명언 하나만!"}
    ],
    "temperature": 0.7,
    "max_tokens": 128,
    "stream": True
}
resp = requests.post(url, json=payload, stream=True)
client = SSEClient(resp)
for event in client.events():
    if event.data == "[DONE]":
        break
    chunk = json.loads(event.data)
    delta = chunk["choices"][0]["delta"].get("content", "")
    print(delta, end="", flush=True)
print()

'k8s' 카테고리의 다른 글

K8s IP로 어느 Pod인지 찾기 (0)	2025.08.30
Let's encrypt Order/Challenge가 pending 일때 (0)	2025.08.23
pod 이름 조회해서 로그 출력하기 (0)	2025.08.16
pv의 CLAIM을 제거 (0)	2025.08.13
helm chart 내역 확인 (0)	2025.08.13

취미생활

취미생활

태그

최근글

댓글

공지사항

아카이브

'k8s' 카테고리의 다른 글

관련글

티스토리툴바