gpu º´·Ä ó¸® ½Ã º´¸ñ Çö»ó ¹®ÀÇ µå¸³´Ï´Ù.

   Á¶È¸ 1743   Ãßõ 0    

딥러능 GPU 관련 문의 사항이 있어 메일 보내 드립니다.


  • 장비명 : DL380a Gen11
  • OS : Ubuntu 22.04
  • Python 3.11.9


드라이브 버전은 535.129.03 / CUDA Version : 12.2 버전에서 CUDA Tool kit을 이용하여 업데이트 후

  • NVIDIA Drive version 555.42.02
  • CUDA Version: 12.5

  • H100 80G * 2EA
  • Llama-3-8B처리시간 : 2.4

  • H100 80G * 1EA
  • Llama-3-8B 처리 시간 : 0.5초

테스트 코드는 아래와 같습니다. 


from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, TextStreamer

import torch

from threading import Thread

import gradio as gr

import time

#import accelerate_speedup


torch.manual_seed(42)


model_id = "meta-llama/Meta-Llama-3-70B-Instruct"

#model_id = "meta-llama/Meta-Llama-3-8B-Instruct"


tokenizer = AutoTokenizer.from_pretrained(model_id)

#max_memory_mapping = {0: "80GB", 1: "80GB"}

model = AutoModelForCausalLM.from_pretrained(

    model_id,

    torch_dtype=torch.bfloat16,

    device_map="auto",

    #device_map="balanced_low_0",

    trust_remote_code=True,

    attn_implementation="flash_attention_2",

    low_cpu_mem_usage=True

    #max_memory=max_memory_mapping

).eval()


terminators = [

    tokenizer.eos_token_id,

    tokenizer.convert_tokens_to_ids("<|eot_id|>")

]



### base inference

def chat(question):

    messages = [

        #{"role": "system", "content": "You are AI chatbot. You are honest, do not harm others, and help users."},

        {"role": "system", "content": "Please try to provide useful, helpful answers."},

        {"role": "user", "content": question},

    ]

    

    input_ids = tokenizer.apply_chat_template(

        messages,

        add_generation_prompt=True,

        return_tensors="pt"

    ).to(model.device, non_blocking=True)

    

    outputs = model.generate(

        input_ids,

        max_new_tokens=1024,

        eos_token_id=terminators,

        do_sample=True,

        temperature=0.05,

        top_p=0.95,

    )

    response = outputs[0][input_ids.shape[-1]:]

    #print(tokenizer.decode(response, skip_special_tokens=True))

    return tokenizer.decode(response, skip_special_tokens=True)



response_times = []

for _ in range(100):

    start_time = time.time()

    #tmp = chat('hello.')

    tmp = chat('hello!')

    #tmp = chat('Testing. Please answer in 10,000 characters.')

    end_time = time.time()

    print((end_time - start_time))

    response_times.append(end_time - start_time)


print(f"Average Response Time: {sum(response_times) / len(response_times):.2f} seconds")

ªÀº±Û Àϼö·Ï ½ÅÁßÇÏ°Ô.
ÀÏ´Ü DL380 Gen 11 System Diagram¸¦ È®ÀÎÇØ º¸¼¼¿ä.
ºñ»ó½ÄÀûÀÎ Delay ³×¿ä.
¸Þ¸ð¸® ¹®Á¦´Â ¾Æ´Ñ°Í °°Àºµ¥¿ä.
PCIe °¡ ¾î¶»°Ô ¹°·Á ÀÖ³ª¿ä?
PCIe 5.0 16¹è¼Ó ÀÌ ÃÖ´ë 64GB/S ±îÁö Áö¿øÀ» Çϴµ¥¿ä..== ÀÏ´ÜÀº ¿©±â¼­ º´¸ñ °°¿¡¿ä..
PCIe 5.0 x16¸¦ °¢ Ä«µå´ç 2°³¾¿ ÁÙ ¼ö ¾ø³ª¿ä??  ±×·¯¸é 128GB/s ±îÁö È®º¸°¡ µÇ´Âµ¥..
ÀÏ´Ü ´ë¾ÈÀº NVlink¸¦ ¼³Ä¡ÇÏ´Â ¼ö ¹Û¿¡´Â ¾ø¾î º¸ÀÔ´Ï´Ù.


#max_memory_mapping = {0: "80GB", 1: "80GB"} ÀÌ°É ÁÜ ÁÙ¿© º¸¼¼¿ä..
PCIe ´ë¿ªÆø¿¡ ¾î¿ï¸®µµ·Ï  {0: "30GB", 1: "30GB"} Á¤µµ·Î¿ä..
epowergate 06-04
ÇÁ·Î±×·¥ÀÌ ´õ ´À¸± ÀÌÀ¯´Â ¾ø´Âµ¥ ´õ ºü¸¦ ÀÌÀ¯µµ ¾ø½À´Ï´Ù.
¼Ò½º°¡ seqÀε¥ ´õ ºü¸¦¸® ¾øÁÒ
¹Ú¹®Çü 06-04
HPE ÂÊ ±â¼úÁö¿ø¿¡´Â ¹®ÀÇÇغ¸¼Ì´ÂÁö¿ä??
ikaros7 06-04
ÄÚµå´Â ¾ÈºÃ½À´Ï´Ù¸¸, Llama3 8Gó·³ gpu Çϳª¿¡ ´Ù ¿Ã¶ó°¡´Â ¸ðµ¨À̶ó¸é ±×³É Çϳª·Î µ¹¸®½Ã´Â°Ô ºü¸£Áö ¾Ê³ª¿ä?
¹ö½º Åë½Å ¿À¹öÇìµå°¡ »ó´çÇÒ °Çµ¥¿ä?
º¹¼ö °³ÀÇ gpu¸¦ ¾²´Â°Ç º¸Åë ÇнÀ½Ã Çϳª·Î´Â gpu ¸Þ¸ð¸®°¡ ºÎÁ·Çؼ­ ±×·²°Ì´Ï´Ù.
±×³ªÀú³ª H100 Á¤¸» ºü¸£±º¿ä. ¤¾¤¾


QnA
Á¦¸ñPage 76/5684
2015-12   1494115   ¹é¸Þ°¡
2014-05   4957684   Á¤ÀºÁØ1
06-04   1079   ´ëµÎ°­¾ÆÁö
06-04   1179   ¾Þ»ó
06-04   1418   quiet
06-04   1380   ÀÌÇÁ¸®Å¸
06-04   1744   ÇÑÁßÀÏ
06-04   1170   ±èâÀ±(WC)
06-04   1457   2CPUÃÖÁÖÈñ
06-04   1090   ¹Ì´ã
06-04   1903   Á¦¿ÂÇÁ·Î
06-03   1438   guest1
06-03   1539   bag0504
06-03   1192   Sakura24
06-03   1623   ¿©ÁÖ³ó¹Î76
06-03   1280   ½ÅÀº¿Ö
06-03   1340   ¾È¼ºÇö
06-03   1246   ¹¹µçÆȾƿä
06-03   1422   Frinc
06-03   1893   ¹Ú
06-03   1145   VSPress
06-03   1099   À嵿°Ç2014