NMI and ehci-hcd died in Lenovo Thinkpad T60
While developing my device driver, I stumbled on a problem that was very difficult to solve. When I plugged the USB 802.11 card, in my notebook, the EHCI HCD died and I got the following debug message:
[ 351.781090] Uhhuh. NMI received for unknown reason b1 on CPU 0.
[ 351.781090] You have some hardware problem, likely on the PCI bus.
[ 351.781090] Dazed and confused, but trying to continue
[ 351.781165] ehci_hcd 0000:00:1d.7: fatal error
[ 351.785054] ehci_hcd 0000:00:1d.7: HC died; cleaning up
[ 351.785108] usb 1-2: MESSAGE_REQUEST_BBREG failed, error -108.
[ 351.785139] vt6656: probe of 1-2:1.0 failed with error -108
[ 351.785158] usb 1-2: USB disconnect, address 4
[ 352.096076] usb 2-2: new full speed USB device using uhci_hcd and address 2
[ 352.295147] usb 2-2: not running at top speed; connect to a high speed hub
[ 352.326289] usb 2-2: configuration #1 chosen from 1 choice
[ 352.331143] usb 2-2: current firmware found.
[ 352.331204] uhci_hcd 0000:00:1d.0: host system error, PCI problems?
[ 352.331211] uhci_hcd 0000:00:1d.0: host controller halted, very bad!
[ 352.331236] uhci_hcd 0000:00:1d.0: HC died; cleaning up
[ 352.832036] usb 2-2: MESSAGE_REQUEST_BBREG failed, error -110.
[ 352.832071] vt6656: probe of 2-2:1.0 failed with error -110
[ 352.832380] usb 2-2: USB disconnect, address 2
I was the only one in the class having this problem with exactly the same device driver code and the same kernel.
I spent the long weekend looking for the root cause of the problem.
On Monday at 4am after having traced the whole kernel’s USB communication stack and having read chapter 13 of the book Linux Device Drivers for the 3rd time, I found the root cause of the problem.
Basically, the problem is that eventually we call function usb_control_msg() to send a message through a control endpoint to the USB device and when we call this function we pass a pointer to a table declared as static. usb_control_msg() internally, at some point calls usb_fill_control_urb() to fill a newly allocated urb that is to be asynchornously sent to the USB device. Then, the table ptr is directly assigned to urb->transfer_buffer.
Later down in the stack urb->transfer_buffer is used for streaming DMA mapping to pass the data to the USB device through the PCI controller.
On Monday at 4 am I read again the URB fields in the book Linux Device Driver and I realized that you cannot pass data declared as static or in the stack through DMA. We were passing a pointer to a static array to usb_control_msg(). You need to dynamically allocate a bounce buffer (with kmalloc()) to be passed to a controller for DMA transfer.
Some PCI controllers fail (mine) and some not (all others). I have a Lenovo Thinkpad T60, I don’t know the PCI controller.
From what I read in Google many people with Lenovo had a similar problem and I could not find a clear solution to the problem.
An interesting point regarding usb_control_msg() is that it is supposed to make life easier for device driver developers and you should not have to know that internally DMA transfer occurs. We should check that usb_bulk_msg() does not have the same problem (at least for Lenovo users).
Finally I’ve developed the patch for solving this problem. Now we dynamically allocate a bounce buffer, copy the table content to this bounce buffer and the ptr is passed to usb_control_msg().
I hope this solution is useful for others as well.
October 27, 2009 at 8:39 am
Hi,
I had the same problem, but where can I find the patch you modifid.
October 27, 2009 at 3:38 pm
The patch was created specifically for the device driver. If you are a kernel programmer I can provide you witht the indications to make the changes. I still have to see whether a similar change is needed in the kernel for other device drivers.
Do you know the device driver where you had that problem?
December 16, 2009 at 9:46 am
Hi,
I would need some info about this.. I think I have the same problem with a usbseral.ko / sierra.ko . What exactly should do the patch?
Please tell me if you need more info,
Thx
December 17, 2009 at 2:34 am
Do you have the device driver source code?
We should check whether the function usb_control_msg is being called. Indeed, this function is tricky because it hides the fact that DMA is used to tranfer the data to the USB device.
December 18, 2009 at 9:33 am
Hi, Thanks for response..
As I’ve seen the drivers does not use usb_control_msg.
http://dl.transfer.ro/ehci.tar-Transfer_RO-18Dec-0c1e5c.gz
Here are the sources of the drivers I use.
ehci_hcd, usbserial usbserial-generic from kernel 2.6.11.7 .
I cannot upgrade to a newer kernel because it runs on a specific hardware with a BSP which is not updated any more. I’ve managed to run on it a kernel 2.6.21.1 (with some limitations, just for testing) and the result is the same..
I run usbserial-generic with “vendor=0×1199 product=0x683c”(sierra wireless modem).
On another sierra wireless modem (“vendor=0×1199 product=0×6812″–lower speed) ehci_hcd does not crash.
Also it may be important .. I have problem with PCI I/O Ports mapping on this hardware so I do not use the UHCI functions of the host controller(VIA VT6212L)
Please let me know if somebody have any debugging idea.
Thx
April 14, 2010 at 12:02 pm
For some reason I’ve noticed this response just now. I’ve received an e-mail today. Do you still have the same problem? Would you contact me at rodolk@yahoo.com?
Rodolfo