Step by step: Ubuntu 16.04, AMD-pro driver, rock solid optiminer 1.4.0

Hi tim-olson, on two nearly identical 7 GPU rigs. The mobo last X16 with risers (GPU2) will always drop. By placing the GPU directly in the X16 slot this will stop and stability is improved. However, a GPU will still drop if you ssh into the machine, it will just not be GPU2 any more, I believe it is now GPU0 (I think another X16 slot). The issue is not that you cant ssh into the rig, the issue is that if you do ssh into the rig it will cause a GPU to drop. Once the system is running on all 7 GPUā€™s it will typically remain running on all GPUā€™s unless you ssh into the rig (even for a few seconds). This has been very consistent and repeatable (ssh connection causes GPUā€™s to drop).

The remaining issue I have is several rigs dropping GPUā€™s at the same time or throwing segmentation faults at the same time. This can only be a power issue as these rigs are all on separate 20A lines and there is no way for them to interact except for a power issue (surge, brownout).

I am out of town this week and have had no issues with the rigs except when they all drop at the same time (happed once this week). Once I reboot, restart, and drop the ssh connection immediately, they just run.

When several rigs drop at the same time, donā€™t the reboot by themselves? If so, donā€™t they restart the miners automatically? Just wondering why you had to ssh in to get things running again.

I donā€™t have auto reboot and miner restart setup on these two rigs as I am having issues with a soft reboot locking up, and then I loose the whole rig.

When I drop a GPU the remaining 6 happily mine without issue as long as I donā€™t ssh in. So if I ssh into a rig I am committed to achieve a soft reboot or perform a hard reset. I am working on a remote / automated method to perform a hard rest using an old raspberry pie, assuming I cant resolve the issue of rigs dropping GPUā€™s simultaneously.

I donā€™t recall - are you running UB 16.x or 14.x or something else? Iā€™m asking because I had a weird issue with 14.04 getting hung on reboot, which I solved, in my case.

Edit: Guess I should have looked at the thread title before asking that question :slight_smile:

16.04 Desktop. I plan to give server a try again over the weekend but its really hard to get server to work with all 7 GPUā€™s.

OK. My problem was UB 14-specific (the Plymouth crap). The reboot failure is likely caused by a process that wonā€™t die, but it may not be the miner(s). If youā€™re going to reboot, you may as well issue a hard reset (shown in optiminerā€™s mine.sh script in the install directory, at the bottom). Note: sudo is not enough privileges to do a hard reset, you have to be ā€œrealā€ root. I can help you with it if you need a hand.

Does anyone have any suggestions for getting the watchdog to work? Iā€™m running mine.sh via rc.local so itā€™s definitely being executed as root, yet optiminer just hangs on resetting the gpu and no commands in watchdog-cmd.sh are executed. I even tried changing the script to just touch a file, yet still nothing. Any ideas?

Then it very well may be a hardware issue. Risers would be number 1 on my list. Keep in mind that the watchdog triggers after watchdog-timeout seconds of no solutions from the GPU. By that time, the system may already be buggered (bus hung up, etc.). I have my watchdog-timeout set to 6 seconds to try to detect the non-responsive GPU faster. Donā€™t know if that may help or not.

Iā€™ve seen this with USB wifi modules as well. Depending on which GPU has issue, I might loose my wifi. I have a watchdog that tests internet connectivity and reboots if it cant connect for over a minute.

Thanks, I am looking into that and may need your help. I thought it was impossible in Ubuntu to run as TRUE root?

I have updated to 1.6.1 and am getting about 320 S/s on my XFX cards and 310 on my Nitro Sapphires. This new version seems more stable but I will still see a GPU drop with ssh connections. You are using static IPā€™s with no DHCP. I am using static IPā€™s via DHCP on an isolated subnet via an IPCop firewall. Perhaps your way is better assuming the issues you have uncovered in 1.6.0 with DNS can be resolved (has 1.6.1 fixed this)? I can move one rig to orange network with no DHCP to test.

I know some process is hanging and causing the reboot hang once a GPU has dropped. I set out to find out what is hanging (I think its an AMD process). Instead of using ssh I setup a local terminal so that I can see what hangs on reboot (my ssh connection dumps me when I reboot so I cant see what is going on). However, now I cant get a GPU to drop. So its not necessarily the ssh connection but perhaps the added network traffic? The NIC is on the same PCI bus as the GPUā€™s, the miner is also using the NIC but its coordinated? I know for me to get Ubuntu 16.04 server to recognize all 7 GPUā€™s I have to disable my NIC and reboot into a local terminal at least once, then I can re-enable the NIC, but its very very touchy.

I get the risers can cause instability. However, for me its always the same PCI slot, I can swap risers around and GPUā€™s around and the problem stays fixed on the mobo PCI slot, not with the GPU or riser (except when I put one GPU on the mobo, then the problem moved to a different slot and stayed there).

Iā€™ll try a few things to test this network traffic idea and get back to you.

Update:
ssh into rig with miner running on local terminal precipitated new failure mode. No single GPU dropped, all GPU S/s rate dropped (and were dropping). Never seen that before. Second try dropped GPU0.

Update:
Ping hung the whole rig and killed the local terminal so I could not investigate (took about an hour). Other rigs on same subnet are fine. So there would seem to be a connection between network traffic and GPU stability possibly because the NIC is on the same bus? I can try a long term test running miner from a local terminal (have not been able to get any GPUā€™s to drop that way). Unfortunately, managing multiple rigs via a local terminal is not viable.

True root is simple: sudo bash
Or, of course, rc.local or root cronjob.

The v1.6.1 DNS issue is resolved. Optiminer refers to ā€œproxy.optiminer.plā€ in the miner for his devfee. A host name lookup gives the IP address as ā€œ88.99.30.25ā€, which I added to my /etc/hosts file, and BAM!, it works fine with no DNS.

Wait! you say that in Ubuntu root cronjob runs and TRUE root? I have been chasing an issue for months with a moronic backup routine running a root cron that corrupts PXE boot Single Board Computers drive images if the system is shutdown overnight (backup runs at 2AM). That would make sense, backup as root runs as system boots exactly when PXE boot is happening. I have had to disable backup cron job to keep these systems running. Thanks, I knew that the backup was running with elevated privileges and causing the issue, but as true root it explains alot. I owe you!

Does optiminer have a static IP? I know my ISP does not guarantee a static IP even though it almost never changes. Its a real pain when I start getting over 10K firewall hits a day and need my IP to change. Happens when one of the kids have a friend over that has an infected device and connects to the guest WiFI (hits go through the roof).

Really donā€™t know if his IP is static. If all my rigs stop, I guess Iā€™ll know :slight_smile: Iā€™ll have to look it up again. Thereā€™s a middle ground, though. /etc/nsswitch.conf defines the lookup order for hosts. The default on linux in general is hosts file first, then dns. So, if you can run without the dns entry, just hosts, you know you have all the ā€œnormalā€ IP references covered in /etc/hosts, then you can put dns back in, so if something gets added later, and itā€™s not in /etc/hosts, it will use dns to resolve it.

Yes, the root cronjob is ā€œrealā€ root, while your user cronjob is you.

So I cant seem to get the miner to drop a GPU using a local terminal instead of ssh. Once I ssh in it will drop eventually, and ping locked up the whole rig in about an hour, and even killed the local terminal. So there IS something going on with network traffic on the PCI bus and GPU stability. I am headed out of town again so Iā€™m going to leave it running like this to see if it will ever drops a GPU. My other 7 GPU rig has been running from Sat when I installed 1.6.1 with no issues (started with an ssh but disconnected immediately).

MB NIC problem, perhaps?

Edit: Maybe try setting that i/f to 100baseT, full-duplex, instead of 1000baseT, if thatā€™s what youā€™re running, normally. 100 Mbits/s is plenty for mining.

Well, I have one 7 GPU rig rock stable for the last couple weeks (has GPU2 on mobo, x16 slot 3). The other seems to have issues with the x16 slots 1 and 3 when on risers (GPU 0 & 2), but not if the card is plugged in directly to the mobo.

Its not the risers (I swapped those and also checked on my test rig), its not the GPUā€™s (Ive swapped those between rigs and my test rig). The moboā€™s are 1 serial number apart but that is the only thing it could be. Sometimes it will work for days then just quit, and start dropping GPU0 immediately when starting the miner. However, it will run all week with 6 GPUā€™s (x16 slot 1 must be empty, and x16 slot 3 must have a GPU directly plugged in).

I am ordering a few x16 powered risers since I cant plug a GPU into slot 1 without blocking another PCI slot. I am also going to try a different mobo on my next build, I am not very happy with MSI. I like the 7GPU rigs and think the stability is just a matter of tweaking till its right. I am thinking ASUS as I have always had good luck with their hardware.

I am now running optiminer 1.6.2 and AMD 16.60 and all seems fine. XFX RX 480 are getting ~320 S/s and Nitro Sapphire RX 480 are getting ~310 S/s and still about 1130 Watts at the wall.

From what I can tell AMD changed a few things under the hood in 16.60 as I now get rpm values for fans from sensors. Iā€™m not holding my breath but I am hopping they start adding some over clock and under volt capabilities.

I think the ssh/ network traffic issue is still present but since thing are now stable I will worry about that another day.

Almost all my rigs are Asrock H97 Ann. MBs, with the remainder being Asrock H81 BTC v1. Have you compared the MB BIOS settings between the two rigs? I know mine have options for PCI slot detection (Gen1, Gen2, Auto in my case).

I systematically tried all the various PCI slot detection BIOS settings including latency but did not make a difference. Right now both mobo have default BIOS settings since it seems to work the best. I think the issue does have something to do with the BIOS implementation on the MSI mobo. This is my first time using MSI and I was concerned about Linux compatibility. I know Asrock and Gigabyte are very Linux friendly as I use them in my business. I have also have good luck with ASUS for Linux servers.

My test rig is an Asrock and I just brought home a Gigabyte 3U system that I may mess around with. I need to order another 7 GPUā€™s and I may as well get a ASUS Z97 and run some tests on all three. I can always turn these MSI into 6 GPU rigs once I find the right mobo for a stable 7 GPU rig. I am also ordering a PCI splitter to play around with, just have to try for 8 GPUā€™s or perhaps use some of these old moboā€™s I have lying arround for 6 or 7 GPU rigs.

1 Like