Should I choose an RTX 4090 or A100 for local LLM inference?

Question

Im stuck between getting a single RTX 4090 or trying to find a used A100 80GB for my local LLM setup and I cant decide if the extra VRAM is actually worth the massive price jump and the cooling issues. Right now Im mostly messing around with Llama 3 8B but I really want to run the bigger 70B models at high precision without it being painfully slow or offloading a bunch of layers to my system ram.

The 4090 is way easier to buy here in Chicago and I can just go pick it up today at Micro Center plus it fits in a normal PC case. But then I see people online showing off their 80GB VRAM rigs and it makes me feel like I might regret getting the 4090 in six months when models keep getting bigger. I got about $6500 saved up for this whole build so I could technically swing a used A100 from eBay but that thing is gonna be loud as hell in my tiny apartment office and I dont even know if my current PSU can handle it without some weird cooling mods.

Is the speed difference for inference actually that big or is it just about fitting the model in memory? Like is a 4090 faster for smaller stuff but totally useless for the big stuff compared to a datacenter card? Kinda leaning toward the 4090 because its cheaper and easier but those 80 gigs are tempting for future proofing...

CinderCinnamon · Accepted Answer

Saw this while scrolling today and it brought back some bad memories. I actually tried the used enterprise card route last year because I wanted that massive VRAM for my projects. It was a total mess tbh. Datacenter cards like the NVIDIA A100 80GB PCIe arent built for a normal room. They need massive airflow since they dont have fans. I spent weeks messing with 3D printed shrouds and loud jet-engine fans, but it still felt like a fire hazard in my office. One night I smelled burning plastic and almost had a heart attack... thought I fried my whole rig. Honestly, the NVIDIA GeForce RTX 4090 24GB GDDR6X might have less memory, but it fits in a case and wont burn your house down. I eventually paired mine with an EVGA SuperNOVA 1000 G6 1000W 80 Plus Gold and its been rock solid. Sometimes safe and reliable beats risky and huge if you value your sanity.

IWillNotReadTerms · Answer

Spot on about noise! I ditched blowers for NVIDIA RTX A6000 48GB.48GB VRAM is amazingQuietest fan ever Way better than NVIDIA GeForce RTX 4090 24GB for large models!

MochaMonolith · Answer

honestly i tried the single card route and it was kinda disappointing for what you want. unfortunately the NVIDIA GeForce RTX 4090 24GB GDDR6X just isnt enough for Llama 3 70B without heavy quantization. youll end up frustrated with the speeds once you start offloading to your ram... trust me its slow. the NVIDIA A100 80GB PCIe Tensor Core GPU is a beast but for a tiny apartment? its a nightmare. i had issues with the heat and the scream of those blower fans in my last build. sadly its just not as good as expected for the price. heres what i would actually do with that budget:

buy two NVIDIA GeForce RTX 3090 24GB GDDR6X cards used.

pick up a beefy EVGA SuperNOVA 1600 G+ 1600W PSU.

stick them in a case with massive airflow. going dual card gives you 48GB VRAM which actually fits 70B models at decent precision. much better than being stuck with 24GB on a 4090. dont even bother with the A100 unless you have a dedicated server room honestly.