This is a blog post which literally drove me crazy for a week. After building our mining rig I experienced a bad WiFi connection with high pings, periodically occuring every 30 seconds.
Just scroll down to see my – fairly simple – solution.
Getting into the mining business
A few weeks ago some of my co-workers and I decided to build a simple mining rig to make some Ethereum tokens. The current exchange rate for Ethereum fell down the last days but it is like it is. Anyhow, we bought 12 Nvidia GTX 1070, 12 riser cards, 2 mainboards, 4 PSUs with 600 W each and a wattmeter. We assembled everything into an open metal cabinet, put an access point (DD-WRT firmware, Linksys) on it and connected the mainboards with the access point.
I have to say that the mining rig itself is located in one of our flats in my study room. The access point on top of the cabinet acts as a wireless bridge to our other flat. Both mainboards and my workstation are connected to the access point are connected with Ethernet cables. The other flat contains an additional access point with a cable modem and internet connectivity. Nothing fancy.
We switched from ethminer to Claymore’s Ethereum Dual miner due to some problems handling multiple cards and wallets. In the end the rigs worked like a charme.
Experiencing lags in Overwatch
Two days later I wanted to play an Overwatch match on my workstation, also located in my study room. The ping was unstable and a simple ping command shows that I had random timeouts and the ping spiked every 30 seconds from 20ms to > 1500ms for a few seconds. This has not happened before the mining rigs were active.
“This must be a software problem of Claymore’s miners”
My first guess was that is has to be a software problem of Claymore’s miner. One of my co-miners tested a single mainboard with one GPU before at his home and everything worked flawlessly. I started to analyze the problem:
- Killed each claymore miner process on rig1 and rig2: no lag occurred
- Started a single claymore miner process: lag occurred every 30 seconds with > 600ms when receiving the first Ethereum share. This indicated a problem of the network implementation of Claymore’s miner or some high bandwidth usage. I checked the bandwidth but one claymore miner instance just requires 12 kBit/s.
- Started tcpdump on rig1 to identify any conspiciuous network activity or packets. Neither UDP nor TCP traffic were eye-catching. I could only relate the receivement of Ethereum shares with latency spikes. The used network bandwidth was still low.
“This must be a network problem with Claymore’s miner”
The last application I had slightly similiar problems was Subversion. 10 years ago SVN sometimes failed to commit data. It turned out that Tortoise SVN struggled with special packets, the MTU size of our company network and the MTU size of our ADSL connection. Because of this, I changed the MTU size of the rig running the single claymore process. It did not influence anything.
Before I tried something else I disabled the network-related services firewalld and chronyd – without success. stracing the miner did also not show anything special.
“This must be a problem with Ethereum protocol and DD-WRT”
Some interesting observation I did was that the ping between rig -> ap2 (bridge) -> ap1 (router) > internet and workstation -> ap2 (bridge) -> ap1 (router) > internet were both bad but pinging directly from the main access point ap1 (router) -> internet showed no problem. What the hell?
I suspected some TCP settings on ap2 (bridge) led to this hickups. Luckily I could check the network settings and stats of both access points (bridge and router) as they are running on DD-WRT. As you can imagine: there were no suspicious network stat (TCP/UDP) changes when a spike occurred.
Could this be a hardware problem?
As I could not see any problem in the software or on the network layer (>= L2), there could only be a generic hardware problem or some L1 error.
During my TCP stats investigation on the access points, I noticed was that the WiFi rate of the bridge (ap2) were unstable and had heavy fluctuations. This were highly unusal as it has not happened before the building of the rigs.
To exclude any directly network related problems I did the simplest possible action: I pulled the Ethernet cables of both rigs (running one active miner process each) so they were no longer connected to the access point. To my suprise I had still network lags. WTF?
After killing both miner processes the network lags went away. This had to be obviously a problem with the GPU load the mining process creates.
To give you some insight: Due to some DD-WRT restrictions the bridge between both access points uses 2.4 GHz and not 5 GHz. Could this be that some interference on the wireless layer?
After googling for “gpu” and “spike” some links catched my eyes:
After reading both posts
- I changed the WiFi channel from 1 to 11
- I removed the DVI cable from a TFT connected to one rig
- I removed the USB keyboard connected to one rig
Nothing changed. This was likely the point I wanted to give up. The last thing to test was using another power connection. The ap2 and all 4 PSUs of the rig were connected to the same connector (psu1,psu2,psu3,psu4)->wattmeter->wall socket. Maybe it could be some spikes in the voltage when the GPU has load, leading to a confused access point hardware?
Changing the wall socket
I had no free wall socket available behind the cabinet containing both rigs. So I put the access point from the top of the rig to the floor and moved it some centimeters in the direction of the other wall. After the access point had power and were connected to ap1 (router) again, the network spikes lowered from 1600 ms to 800 ms. Uhm? I again moved ap1 20 centimeters away from the cabinet. Spikes went down to 400ms.
In a distance of 1.50 meter between rig and access point no more spikes occurred. I counterchecked if the the different wall socket was the solution. But switching from one wall socket to the wattmeter-connected connector made no difference.
So simple. By just moving the access point away. This whole thing drove me crazy for atleast 5 afternoons. I felt so stupid.
The high load of the GPU when running the Ethereum mining process produces either a signal at 2.4 GHz (which is more unlikely) or a harmony around 1.2 GHz (which is more likely). I assume that the spike every 30 seconds occur when both rigs receive the same mining job at almost the same time and start the mining. If anybody has more information, just let me know. I am heavily interested in the technical explaination for this.