딥러능 GPU 관/144; 문1032; 사항1060; 1080;Ǻ12; 메1068; 보내 드립니다.
- 1109;비명 : DL380a Gen11
- OS : Ubuntu 22.04
- Python 3.11.9
드라1060;브 버1204;1008; 535.129.03 / CUDA Version : 12.2 버1204;에서 CUDA Tool kit1012; 1060;용Ȣ16;여 업데1060;트 후
- NVIDIA Drive version 555.42.02
- CUDA Version: 12.5
- H100 80G * 2EA
- Llama-3-8B처리시간 : 2.4
- H100 80G * 1EA
- Llama-3-8B 처리 시간 : 0.5초
테스트 코드lj16; 아래와 같습니다.160;
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, TextStreamer
import torch
from threading import Thread
import gradio as gr
import time
#import accelerate_speedup
torch.manual_seed(42)
model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
#model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
#max_memory_mapping = {0: "80GB", 1: "80GB"}
model = AutoModelForCausalLM.from_pretrained(
160; 160; model_id,
160; 160; torch_dtype=torch.bfloat16,
160; 160; device_map="auto",
160; 160; #device_map="balanced_low_0",
160; 160; trust_remote_code=True,
160; 160; attn_implementation="flash_attention_2",
160; 160; low_cpu_mem_usage=True
160; 160; #max_memory=max_memory_mapping
).eval()
terminators = [
160; 160; tokenizer.eos_token_id,
160; 160; tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
### base inference
def chat(question):
160; 160; messages = [
160; 160; 160; 160; #{"role": "system", "content": "You are AI chatbot. You are honest, do not harm others, and help users."},
160; 160; 160; 160; {"role": "system", "content": "Please try to provide useful, helpful answers."},
160; 160; 160; 160; {"role": "user", "content": question},
160; 160; ]
160; 160;160;
160; 160; input_ids = tokenizer.apply_chat_template(
160; 160; 160; 160; messages,
160; 160; 160; 160; add_generation_prompt=True,
160; 160; 160; 160; return_tensors="pt"
160; 160; ).to(model.device, non_blocking=True)
160; 160;160;
160; 160; outputs = model.generate(
160; 160; 160; 160; input_ids,
160; 160; 160; 160; max_new_tokens=1024,
160; 160; 160; 160; eos_token_id=terminators,
160; 160; 160; 160; do_sample=True,
160; 160; 160; 160; temperature=0.05,
160; 160; 160; 160; top_p=0.95,
160; 160; )
160; 160; response = outputs[0][input_ids.shape[-1]:]
160; 160; #print(tokenizer.decode(response, skip_special_tokens=True))
160; 160; return tokenizer.decode(response, skip_special_tokens=True)
response_times = []
for _ in range(100):
160; 160; start_time = time.time()
160; 160; #tmp = chat('hello.')
160; 160; tmp = chat('hello!')
160; 160; #tmp = chat('Testing. Please answer in 10,000 characters.')
160; 160; end_time = time.time()
160; 160; print((end_time - start_time))
160; 160; response_times.append(end_time - start_time)
print(f"Average Response Time: {sum(response_times) / len(response_times):.2f} seconds")
ºñ»ó½ÄÀûÀÎ Delay ³×¿ä.
¸Þ¸ð¸® ¹®Á¦´Â ¾Æ´Ñ°Í °°Àºµ¥¿ä.
PCIe °¡ ¾î¶»°Ô ¹°·Á ÀÖ³ª¿ä?
PCIe 5.0 16¹è¼Ó ÀÌ ÃÖ´ë 64GB/S ±îÁö Áö¿øÀ» Çϴµ¥¿ä..== ÀÏ´ÜÀº ¿©±â¼ º´¸ñ °°¿¡¿ä..
PCIe 5.0 x16¸¦ °¢ Ä«µå´ç 2°³¾¿ ÁÙ ¼ö ¾ø³ª¿ä?? ±×·¯¸é 128GB/s ±îÁö È®º¸°¡ µÇ´Âµ¥..
ÀÏ´Ü ´ë¾ÈÀº NVlink¸¦ ¼³Ä¡ÇÏ´Â ¼ö ¹Û¿¡´Â ¾ø¾î º¸ÀÔ´Ï´Ù.
#max_memory_mapping = {0: "80GB", 1: "80GB"} ÀÌ°É ÁÜ ÁÙ¿© º¸¼¼¿ä..
PCIe ´ë¿ªÆø¿¡ ¾î¿ï¸®µµ·Ï {0: "30GB", 1: "30GB"} Á¤µµ·Î¿ä..
¼Ò½º°¡ seqÀε¥ ´õ ºü¸¦¸® ¾øÁÒ
¹ö½º Åë½Å ¿À¹öÇìµå°¡ »ó´çÇÒ °Çµ¥¿ä?
º¹¼ö °³ÀÇ gpu¸¦ ¾²´Â°Ç º¸Åë ÇнÀ½Ã Çϳª·Î´Â gpu ¸Þ¸ð¸®°¡ ºÎÁ·Çؼ ±×·²°Ì´Ï´Ù.
±×³ªÀú³ª H100 Á¤¸» ºü¸£±º¿ä. ¤¾¤¾