Step by step: Ubuntu 16.04, AMD-pro driver, rock solid optiminer 1.4.0

ZC93 · February 2, 2017, 11:32pm

Hi tim-olson, on two nearly identical 7 GPU rigs. The mobo last X16 with risers (GPU2) will always drop. By placing the GPU directly in the X16 slot this will stop and stability is improved. However, a GPU will still drop if you ssh into the machine, it will just not be GPU2 any more, I believe it is now GPU0 (I think another X16 slot). The issue is not that you cant ssh into the rig, the issue is that if you do ssh into the rig it will cause a GPU to drop. Once the system is running on all 7 GPU’s it will typically remain running on all GPU’s unless you ssh into the rig (even for a few seconds). This has been very consistent and repeatable (ssh connection causes GPU’s to drop).

The remaining issue I have is several rigs dropping GPU’s at the same time or throwing segmentation faults at the same time. This can only be a power issue as these rigs are all on separate 20A lines and there is no way for them to interact except for a power issue (surge, brownout).

I am out of town this week and have had no issues with the rigs except when they all drop at the same time (happed once this week). Once I reboot, restart, and drop the ssh connection immediately, they just run.

dlehenky · February 2, 2017, 11:46pm

When several rigs drop at the same time, don’t the reboot by themselves? If so, don’t they restart the miners automatically? Just wondering why you had to ssh in to get things running again.

ZC93 · February 3, 2017, 12:01am

I don’t have auto reboot and miner restart setup on these two rigs as I am having issues with a soft reboot locking up, and then I loose the whole rig.

When I drop a GPU the remaining 6 happily mine without issue as long as I don’t ssh in. So if I ssh into a rig I am committed to achieve a soft reboot or perform a hard reset. I am working on a remote / automated method to perform a hard rest using an old raspberry pie, assuming I cant resolve the issue of rigs dropping GPU’s simultaneously.

dlehenky · February 3, 2017, 12:11am

I don’t recall - are you running UB 16.x or 14.x or something else? I’m asking because I had a weird issue with 14.04 getting hung on reboot, which I solved, in my case.

Edit: Guess I should have looked at the thread title before asking that question

ZC93 · February 3, 2017, 12:23am

16.04 Desktop. I plan to give server a try again over the weekend but its really hard to get server to work with all 7 GPU’s.

dlehenky · February 3, 2017, 12:35am

OK. My problem was UB 14-specific (the Plymouth crap). The reboot failure is likely caused by a process that won’t die, but it may not be the miner(s). If you’re going to reboot, you may as well issue a hard reset (shown in optiminer’s mine.sh script in the install directory, at the bottom). Note: sudo is not enough privileges to do a hard reset, you have to be “real” root. I can help you with it if you need a hand.

glambeth · February 5, 2017, 9:32pm

Does anyone have any suggestions for getting the watchdog to work? I’m running mine.sh via rc.local so it’s definitely being executed as root, yet optiminer just hangs on resetting the gpu and no commands in watchdog-cmd.sh are executed. I even tried changing the script to just touch a file, yet still nothing. Any ideas?

dlehenky · February 5, 2017, 10:18pm

Then it very well may be a hardware issue. Risers would be number 1 on my list. Keep in mind that the watchdog triggers after watchdog-timeout seconds of no solutions from the GPU. By that time, the system may already be buggered (bus hung up, etc.). I have my watchdog-timeout set to 6 seconds to try to detect the non-responsive GPU faster. Don’t know if that may help or not.

dallase · February 6, 2017, 2:23pm

I’ve seen this with USB wifi modules as well. Depending on which GPU has issue, I might loose my wifi. I have a watchdog that tests internet connectivity and reboots if it cant connect for over a minute.

ZC93 · February 7, 2017, 12:23am

Thanks, I am looking into that and may need your help. I thought it was impossible in Ubuntu to run as TRUE root?

I have updated to 1.6.1 and am getting about 320 S/s on my XFX cards and 310 on my Nitro Sapphires. This new version seems more stable but I will still see a GPU drop with ssh connections. You are using static IP’s with no DHCP. I am using static IP’s via DHCP on an isolated subnet via an IPCop firewall. Perhaps your way is better assuming the issues you have uncovered in 1.6.0 with DNS can be resolved (has 1.6.1 fixed this)? I can move one rig to orange network with no DHCP to test.

I know some process is hanging and causing the reboot hang once a GPU has dropped. I set out to find out what is hanging (I think its an AMD process). Instead of using ssh I setup a local terminal so that I can see what hangs on reboot (my ssh connection dumps me when I reboot so I cant see what is going on). However, now I cant get a GPU to drop. So its not necessarily the ssh connection but perhaps the added network traffic? The NIC is on the same PCI bus as the GPU’s, the miner is also using the NIC but its coordinated? I know for me to get Ubuntu 16.04 server to recognize all 7 GPU’s I have to disable my NIC and reboot into a local terminal at least once, then I can re-enable the NIC, but its very very touchy.

I get the risers can cause instability. However, for me its always the same PCI slot, I can swap risers around and GPU’s around and the problem stays fixed on the mobo PCI slot, not with the GPU or riser (except when I put one GPU on the mobo, then the problem moved to a different slot and stayed there).

I’ll try a few things to test this network traffic idea and get back to you.

Update:
ssh into rig with miner running on local terminal precipitated new failure mode. No single GPU dropped, all GPU S/s rate dropped (and were dropping). Never seen that before. Second try dropped GPU0.

Update:
Ping hung the whole rig and killed the local terminal so I could not investigate (took about an hour). Other rigs on same subnet are fine. So there would seem to be a connection between network traffic and GPU stability possibly because the NIC is on the same bus? I can try a long term test running miner from a local terminal (have not been able to get any GPU’s to drop that way). Unfortunately, managing multiple rigs via a local terminal is not viable.

dlehenky · February 7, 2017, 12:42am

True root is simple: sudo bash
Or, of course, rc.local or root cronjob.

The v1.6.1 DNS issue is resolved. Optiminer refers to “proxy.optiminer.pl” in the miner for his devfee. A host name lookup gives the IP address as “88.99.30.25”, which I added to my /etc/hosts file, and BAM!, it works fine with no DNS.

ZC93 · February 7, 2017, 12:59am

Wait! you say that in Ubuntu root cronjob runs and TRUE root? I have been chasing an issue for months with a moronic backup routine running a root cron that corrupts PXE boot Single Board Computers drive images if the system is shutdown overnight (backup runs at 2AM). That would make sense, backup as root runs as system boots exactly when PXE boot is happening. I have had to disable backup cron job to keep these systems running. Thanks, I knew that the backup was running with elevated privileges and causing the issue, but as true root it explains alot. I owe you!

ZC93 · February 7, 2017, 1:07am

Does optiminer have a static IP? I know my ISP does not guarantee a static IP even though it almost never changes. Its a real pain when I start getting over 10K firewall hits a day and need my IP to change. Happens when one of the kids have a friend over that has an infected device and connects to the guest WiFI (hits go through the roof).

dlehenky · February 7, 2017, 1:14am

Really don’t know if his IP is static. If all my rigs stop, I guess I’ll know I’ll have to look it up again. There’s a middle ground, though. /etc/nsswitch.conf defines the lookup order for hosts. The default on linux in general is hosts file first, then dns. So, if you can run without the dns entry, just hosts, you know you have all the “normal” IP references covered in /etc/hosts, then you can put dns back in, so if something gets added later, and it’s not in /etc/hosts, it will use dns to resolve it.

dlehenky · February 7, 2017, 1:32am

Yes, the root cronjob is “real” root, while your user cronjob is you.

ZC93 · February 7, 2017, 4:56pm

So I cant seem to get the miner to drop a GPU using a local terminal instead of ssh. Once I ssh in it will drop eventually, and ping locked up the whole rig in about an hour, and even killed the local terminal. So there IS something going on with network traffic on the PCI bus and GPU stability. I am headed out of town again so I’m going to leave it running like this to see if it will ever drops a GPU. My other 7 GPU rig has been running from Sat when I installed 1.6.1 with no issues (started with an ssh but disconnected immediately).

dlehenky · February 7, 2017, 5:03pm

MB NIC problem, perhaps?

Edit: Maybe try setting that i/f to 100baseT, full-duplex, instead of 1000baseT, if that’s what you’re running, normally. 100 Mbits/s is plenty for mining.

ZC93 · February 15, 2017, 11:30pm

Well, I have one 7 GPU rig rock stable for the last couple weeks (has GPU2 on mobo, x16 slot 3). The other seems to have issues with the x16 slots 1 and 3 when on risers (GPU 0 & 2), but not if the card is plugged in directly to the mobo.

Its not the risers (I swapped those and also checked on my test rig), its not the GPU’s (Ive swapped those between rigs and my test rig). The mobo’s are 1 serial number apart but that is the only thing it could be. Sometimes it will work for days then just quit, and start dropping GPU0 immediately when starting the miner. However, it will run all week with 6 GPU’s (x16 slot 1 must be empty, and x16 slot 3 must have a GPU directly plugged in).

I am ordering a few x16 powered risers since I cant plug a GPU into slot 1 without blocking another PCI slot. I am also going to try a different mobo on my next build, I am not very happy with MSI. I like the 7GPU rigs and think the stability is just a matter of tweaking till its right. I am thinking ASUS as I have always had good luck with their hardware.

I am now running optiminer 1.6.2 and AMD 16.60 and all seems fine. XFX RX 480 are getting ~320 S/s and Nitro Sapphire RX 480 are getting ~310 S/s and still about 1130 Watts at the wall.

From what I can tell AMD changed a few things under the hood in 16.60 as I now get rpm values for fans from sensors. I’m not holding my breath but I am hopping they start adding some over clock and under volt capabilities.

I think the ssh/ network traffic issue is still present but since thing are now stable I will worry about that another day.

dlehenky · February 16, 2017, 12:17am

Almost all my rigs are Asrock H97 Ann. MBs, with the remainder being Asrock H81 BTC v1. Have you compared the MB BIOS settings between the two rigs? I know mine have options for PCI slot detection (Gen1, Gen2, Auto in my case).

ZC93 · February 16, 2017, 5:02am

I systematically tried all the various PCI slot detection BIOS settings including latency but did not make a difference. Right now both mobo have default BIOS settings since it seems to work the best. I think the issue does have something to do with the BIOS implementation on the MSI mobo. This is my first time using MSI and I was concerned about Linux compatibility. I know Asrock and Gigabyte are very Linux friendly as I use them in my business. I have also have good luck with ASUS for Linux servers.

My test rig is an Asrock and I just brought home a Gigabyte 3U system that I may mess around with. I need to order another 7 GPU’s and I may as well get a ASUS Z97 and run some tests on all three. I can always turn these MSI into 6 GPU rigs once I find the right mobo for a stable 7 GPU rig. I am also ordering a PCI splitter to play around with, just have to try for 8 GPU’s or perhaps use some of these old mobo’s I have lying arround for 6 or 7 GPU rigs.