gpu ó 帳ϴ.

   
   ȸ 1571   õ 0    

딥러능 GPU 관련 문의 사항이 있어 메일 보내 드립니다.


  • 장비명 : DL380a Gen11
  • OS : Ubuntu 22.04
  • Python 3.11.9


드라이브 버전은 535.129.03 / CUDA Version : 12.2 버전에서 CUDA Tool kit을 이용하여 업데이트 후

  • NVIDIA Drive version 555.42.02
  • CUDA Version: 12.5

  • H100 80G * 2EA
  • Llama-3-8B처리시간 : 2.4

  • H100 80G * 1EA
  • Llama-3-8B 처리 시간 : 0.5초

테스트 코드는 아래와 같습니다. 


from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, TextStreamer

import torch

from threading import Thread

import gradio as gr

import time

#import accelerate_speedup


torch.manual_seed(42)


model_id = "meta-llama/Meta-Llama-3-70B-Instruct"

#model_id = "meta-llama/Meta-Llama-3-8B-Instruct"


tokenizer = AutoTokenizer.from_pretrained(model_id)

#max_memory_mapping = {0: "80GB", 1: "80GB"}

model = AutoModelForCausalLM.from_pretrained(

    model_id,

    torch_dtype=torch.bfloat16,

    device_map="auto",

    #device_map="balanced_low_0",

    trust_remote_code=True,

    attn_implementation="flash_attention_2",

    low_cpu_mem_usage=True

    #max_memory=max_memory_mapping

).eval()


terminators = [

    tokenizer.eos_token_id,

    tokenizer.convert_tokens_to_ids("<|eot_id|>")

]



### base inference

def chat(question):

    messages = [

        #{"role": "system", "content": "You are AI chatbot. You are honest, do not harm others, and help users."},

        {"role": "system", "content": "Please try to provide useful, helpful answers."},

        {"role": "user", "content": question},

    ]

    

    input_ids = tokenizer.apply_chat_template(

        messages,

        add_generation_prompt=True,

        return_tensors="pt"

    ).to(model.device, non_blocking=True)

    

    outputs = model.generate(

        input_ids,

        max_new_tokens=1024,

        eos_token_id=terminators,

        do_sample=True,

        temperature=0.05,

        top_p=0.95,

    )

    response = outputs[0][input_ids.shape[-1]:]

    #print(tokenizer.decode(response, skip_special_tokens=True))

    return tokenizer.decode(response, skip_special_tokens=True)



response_times = []

for _ in range(100):

    start_time = time.time()

    #tmp = chat('hello.')

    tmp = chat('hello!')

    #tmp = chat('Testing. Please answer in 10,000 characters.')

    end_time = time.time()

    print((end_time - start_time))

    response_times.append(end_time - start_time)


print(f"Average Response Time: {sum(response_times) / len(response_times):.2f} seconds")

ª ϼ ϰ.
06-04
ϴ DL380 Gen 11 System Diagram Ȯ .
Delay ׿.
޸ ƴѰ .
PCIe  ֳ?
PCIe 5.0 16 ִ 64GB/S ϴµ..== ϴ ⼭ ..
PCIe 5.0 x16 ī 2 ??  ׷ 128GB/s Ȯ Ǵµ..
ϴ NVlink ġϴ ۿ Դϴ.


#max_memory_mapping = {0: "80GB", 1: "80GB"} ̰ ٿ ..
PCIe 뿪 ︮  {0: "30GB", 1: "30GB"} ο..
epowergate 06-04
α׷ µ ϴ.
ҽ seqε
ڹ 06-04
HPE غ̴??
ikaros7 06-04
ڵ Ⱥýϴٸ, Llama3 8Gó gpu ϳ ö󰡴 ̶ ׳ ϳ ô° ʳ?
尡 ǵ?
gpu ° н ϳδ gpu ޸𸮰 ؼ ׷̴ϴ.
׳ H100 .


QnA
Page 5584/5666
2014-05   4909228   1
2015-12   1447591   ް
04-11   1096   ǹ̡
2022-05   1096   ׸
2022-07   1096   ƹ
2023-12   1095   sdlfkjwer
07-14   1095   Q8300
01-22   1095  
2022-11   1095   2CPU
2022-06   1095  
2023-05   1095   ̴
2023-11   1095   ſ켷
2023-05   1095   ѽù
02-20   1095   ocarina
2023-06   1095  
2023-06   1094   ¿
2023-01   1094   drezip
06-04   1094   â(WC)
01-30   1094   HEJYS
2022-09   1093   hh1733
2022-04   1093  
02-07   1093   Ѷ