bloggerads

2015年10月21日 星期三

MCA (Machine Check Architecture)

MCA 最初的設計是用來report hardware error,如 system bus errors, ECC errors, parity errors,
cache errors, TLB errors等等。

register主要分成兩部分 Global Control MSRsError-Reporting Bank Registers



# Global Control MSRs分成三種:

1. IA32_MCG_CAP :  這個從字面上看的出來就是能力的宣告





  • Count field, bits 7:0: Bank的implement的數量
  • MCG_CTL_P (control MSR present) flag, bit 8: 有沒有做IA32_MCG_CTL MSR
  • MCG_EXT_P (extended MSRs present) flag, bit 9: 有沒有做extended machine-check state registers? 有的話extended MSRs從MSR address 180H開始
  • MCG_CMCI_P (Corrected MC error counting/signaling extension present) flag, bit 10: 有沒有支援CMCI機制? 但還是要每個bank檢查
  • MCG_TES_P (threshold-based error status present) flag, bit 11: 決定bank的MCi_STATUS中的bit 54:53有沒有作用(bit 54:53用來顯示CMCI error 大於或小於threshold)
  • MCG_EXT_CNT, bits 23:16 : 若bit 9設為1,這欄才有效,用來說明有多少額外的MSR數量
  • MCG_SER_P (software error recovery support present) flag, bit 24 :設1則MCi_STATUS的bit 55(AR; Action Required for UCR) 和 bit 56(S; Signaling an UCR error)才有效
2. IA32_MCG_STATUS 
  • RIPV (restart IP valid) flag, bit 0 : Indicates (when set) that program execution can be restarted reliably at the instruction pointed to by the instruction pointer pushed on the stack when the machine-check exception is generated. (MCE發生時,堆疊內的指令指標是有效的因此程式可以restart)
  • EIPV (error IP valid) flag, bit 1 : Indicates (when set) that the instruction pointed to by the instruction pointer pushed onto the stack when the machine check exception is generated is directly associated with the error. (MCE發生時,堆疊內的指令指標就是問題所在)
  • MCIP (machine check in progress) flag, bit 2 : Indicates (when set) that a machine-check exception was generated.(MCE發生會舉這個bit) Software can set or clear this flag. The occurrence of a second Machine-Check Event while MCIP is set will cause the processor to enter a shutdown state. 
3. IA32_MCG_CTL 
  • 只有所有bit全設為1(Enable MCE feature)或全設為0(Disable MCE feature)這兩種選項。

# Error-Reporting Bank Registers 

1. IA32_MCi_CTL  
  • Each of the 64 flags (EEj) represents a potential error. Setting an EEj flag enables reporting of the associated error and clearing it disables reporting of the error.

2. IA32_MCi_STATUS
  • VAL (valid) bit 63 : 這個bit為1代表這個register是有效的,software 負責清掉此bit
  • EN (error enabled) flag, bit 60 — Indicates (when set) that the error was enabled by the associated EEj bit of the IA32_MCi_CTL register.
  • 其他bit的定義請參考下圖 (詳細說明請參考spec)

3. IA32_MCi_CTL2
  • 當MCG_CAP.CMCI_P(bit10)=1 時這個register才有效,請參考這個register如下圖

  • CMCI_EN-Corrected error interrupt enable/disable/indicator, bits 30 : Software sets this bit to enable the generation of corrected machine-check error interrupt (CMCI).
  • Corrected error count threshold, bits 14:0 : Software must initialize this field.(軟體決定要報 CMCI 的 threshold) The value is compared with the IA32_MCi_STATUS.Corrected Error Count (bit 58-32). An overflow event is signaled to the CMCI LVT entry (see Table 10-1) in the APIC when the count value equals the threshold value.  (請參考下圖)
------------------------------------------------------------------------------------------------------------------
<Bank 的 Error classification>
<UCR Error Overwrite Rules>

In general, the overwrite rules are as follows:
  • UCR errors will overwrite corrected errors.
  • Uncorrected (PCC=1) errors overwrite UCR (PCC=0) errors.
  • UCR errors are not written over previous UCR errors.
  • Corrected errors do not write over previous UCR errors.

<INTERPRETING THE MCA ERROR CODES> 解讀MCA Error Code, 即 MCi_STATUS[15:0]
  • Simple Error Code
  • Compound Error Code


名詞解釋:
  • Memory Scrubbing: ECC correct 後得到原本正確的值並將其寫回
  • Explicit Writeback: When a cache decides to evict a dirty line, it generally issues an  "explicit writeback" (指出哪裡(line)有問題並搬走資料)
<MCA進階> 

MCE features在Build linux kernel的模組選單裡預設就是開啟的, 由此可見他的重要性


MCA發展至今已到第二代( Legacy MCA -> EMCA Gen1 -> EMCA Gen2),主要差別是Bios (Firmware First Mode; FFM) (透過SMI)介入時機與處理程度。
eMCA Gen2實現在Xeon® Processor E7 – v3,他除了定義了兩個新的MSR (CORE_SMI_ERR_SRC and UNCORE_SMI_ERR_SRC)並且使用FFM統一處理錯誤,根據不同的error程度(CE/UCE)發CSMI/MSMI進入SMM handler交給Bios處理。以下列出各代的差別


FFM (Firmware First Mode):

3 則留言: