Description

This is a machine learning server that we will use at our company. It will eventually live in our cooled server room. I wanted to go with the Corsair Carbide Air 540 case, but my boss said it was too big. The GD07B he chose should be rack mountable. Also, the Seagate hard disk drive (HDD) I chose was replaced with a Western Digital (WD) HDD because we have had problems with Seagate.

In theory, this build can support up to four graphics processing unites (GPUs) for 4-way Scalable Link Interface (SLI). In theory all the cables will fit in the case. In practice, getting everything to fit in this particular case will be a hassle. I would recommend the bigger Corsair Carbide Air 540 if you want to do 4-way SLI and you can spare the space.

X299 TAICHI + i9-7900X and X99-M WS + i7-5930K builds were also evaluated. The X99-M WS is great for 2 GPUs if you want to user older hardware. The Z10PE-D8 WS is an alternative to the X99-M WS if you want to use 2 CPUs and 4 GPUs. The X299 TAICHI is a good motherboard if you want 3-way SLI.

External GPU (eGPU) enclosures with GTX 1080 Tis were also evaluated, but my coworkers did not like that option because they use laptops. They would need to unplug the eGPU and lose an experiment to take their laptop to a meeting.

The X299 SAGE strikes me as a premium motherboard for gaming, ML or crypto mining. For what it is worth, I would love to see a WS X299 SAGE 4-way SLI RTX 2080 Ti build. =)

References for this Build

Comments

  • 13 months ago
  • 1 point

Well you’ll save your company some money in heating costs!

  • 13 months ago
  • 1 point

How’s that, are they known for that type of service?

[comment deleted]
  • 13 months ago
  • 1 point

Oh I thought they were talking about Seagate not being reliable

  • 13 months ago
  • 1 point

One of the big differences between a gaming build and a graphics processing unit (GPU) based machine learning (ML) build is that the ML build can get away with a wimpy central processing unit (CPU) cooler. The CPU's job in this kind of build is simply to get data onto and off of the GPUs. The GPUs do the heavy lifting.

I would have been comfortable with an even wimpier CPU cooler, but this one was cheap and readily available. Also, the machine will ultimate live in a cooled server room.

Furthermore, my boss is convinced that gamers overspend on cooling. Take that as you will. Either way, this order went through my boss. Your specific situation will certainly be different.

  • 12 months ago
  • 1 point

Hey, why'd you go with such a high-end CPU? It looks kind of like a deep learning build, are you using it for cpu-based algorithms as well? Does SLI offer any benefit for machine learning? I thought you need nvlink to scale over multiple GPU's.

Re: Threadripper, numpy hates it :/ Intel MKL FTW!

  • 12 months ago
  • 1 point

Are you using Numpy on Windows? Apparently Windows's scheduler sucks for NUMA systems like Threadripper. Since Numpy is memory-bandwidth-intensive, it would perform poorly on Windows, but on Linux it should be fine.

  • 11 months ago
  • 2 points

Definitely not windows, and I got the benchmark data from: https://openbenchmarking.org/test/pts/numpy&search

There are a lot of different factors that can influence benchmarks like these, but in general intel seems to be faster which makes sense because the AMD chips use 4 interconnected dies, which introduces some bottlenecks, including those related to memory bandwidth.

  • 10 months ago
  • 1 point

I was under the impression I needed a CPU with a lot PCIe lanes to run four cards. I have also read that that is not the case when using SLI. We are set up for two card SLI, and I am pretty sure the MB comes a bridge for four way SLI.

This was a deep learning build. After having deployed it, it has been used for CPU based machine learning, so the CPU has been useful. Had I known people would use it for heavy CPU work, I probably would have considered a better CPU cooler.

My supervisor was concerned about optimizations for ML with the AMD chips. I love AMD, but the market tends to be winner-take-all, so Intel it is.

FWIW, we have three people sharing this machine via remote login. It is running Ubuntu.

  • 11 months ago
  • 1 point

Hey, what is the RAM clearance of this cooler? I will use it with an 2600X. :)

(P.S: I will use an ITX motherboard)

  • 10 months ago
  • 1 point

There were no real problems, although getting everything in this case was tight.

  • 8 months ago
  • 1 point

How long does this motherboard take to boot? Mine is very slow, it takes 30 seconds to show something on screen after pressing power button. https://youtu.be/R7etwTbca0E

  • 7 months ago
  • 1 point

Thank you for the YouTube video. It is nice to see a version of this ML build in the Corsair Carbide Air 540 case.

We rarely turn it off, so I have no idea how long it takes to start. Also, I am pretty sure we are running Ubuntu, not Windows. Two or three people are constantly logged in remotely running ML jobs. If we ever did need to restart it, it takes longer than 30 seconds to get from the server room back to assigned seating.

When I asked my coworker, he said "I guess it takes some time to boot but we don’t care much".

  • 13 months ago
  • -1 points

Nice build, i don’t think people are fans of Seagate, I read a lot of stuff people say about how Seagate isn’t reliable, but this is a nice build, I like it +1

  • 13 months ago
  • 2 points

I bought a 1TB HDD and haven't had a problem and it's been a year.

  • 13 months ago
  • 0 points

Great, it turns out I was wrong, well I’m glad you had no problems with your hdd

  • 13 months ago
  • 2 points

I've had a 2tb HDD of there's for 6 years and have never had a problem

  • 13 months ago
  • 2 points

Oh, I was wrong then, I’m glad your hdd works perfectly though

  • 13 months ago
  • 2 points

I honestly just picked a hard disk drive (HDD) that was lightly optimized for price. I did not think the HDD mattered much. It happened to be a Seagate. My boss did not like the choice and replaced it with a Western Digital (WD) Red drive. I imagine most Seagate drives are fine most of the time. Seagate would be out of business if that were not true.

Furthermore, I suspect that the WD drives are actually more reliable, but that this is not something most home users need to worry about most of the time. We have lots of HDDs that facilitate non-trivial amounts of cash transactions. If hard drives fail in way that actually causes an outage, that is an expensive problem that needs to be solved quickly. My boss should be in the mindset of not skimping on HDDs.

None of that matters for this machine. You will notice that it has no HDD redundancy. We move data onto it for an experiment and collect the results when the experiment is done. If the HDD fails, we lose that experiment, replace the drive and start over. That works for us.

I'm still not convinced the HDD matters. Your boss may think like my boss if you want to replicate a version of this solution where you work.

  • 12 months ago
  • 1 point

I see plenty of people online slagging off the barracudas.

I've got a 2tb barracude that's been running for 2 years and 95 days (as reported by SMART power on hours) and had no issues. Guess seagate had some bad batches some years ago

[comment deleted by staff]
  • 12 months ago
  • 1 point

This works for us. I really like Threadripper. We do not need ECC for our present use case. Anyone thinking about a machine learning server should consider it.

I would love to see a Threadripper build with ECC memory on a motherboard that supports up to four GPUs.

  • 12 months ago
  • 1 point

Here's mine (semi-WIP): https://pcpartpicker.com/b/LW6scf

Currently it only has a wimpy GPU I got for free, but I'm planning on upgrading it later. The ASRock Taichi can support 4 graphics cards in x16/x8/x16/x8, or in principle even more with adapter cables since it supports PCIe bifurcation.

  • 10 months ago
  • 1 point

Thank you for sharing. I remember evaluating the Taichi but went with the SAGE instead.

[comment deleted by staff]