MCA 最初的設計是用來report hardware error,如 system bus errors, ECC errors, parity errors,
cache errors, TLB errors等等。
register主要分成兩部分 Global Control MSRs與 Error-Reporting Bank Registers
# Global Control MSRs分成三種:
1. IA32_MCG_CAP : 這個從字面上看的出來就是能力的宣告
FFM (Firmware First Mode):
- Count field, bits 7:0: Bank的implement的數量
- MCG_CTL_P (control MSR present) flag, bit 8: 有沒有做IA32_MCG_CTL MSR
- MCG_EXT_P (extended MSRs present) flag, bit 9: 有沒有做extended machine-check state registers? 有的話extended MSRs從MSR address 180H開始
- MCG_CMCI_P (Corrected MC error counting/signaling extension present) flag, bit 10: 有沒有支援CMCI機制? 但還是要每個bank檢查
- MCG_TES_P (threshold-based error status present) flag, bit 11: 決定bank的MCi_STATUS中的bit 54:53有沒有作用(bit 54:53用來顯示CMCI error 大於或小於threshold)
- MCG_EXT_CNT, bits 23:16 : 若bit 9設為1,這欄才有效,用來說明有多少額外的MSR數量
- MCG_SER_P (software error recovery support present) flag, bit 24 :設1則MCi_STATUS的bit 55(AR; Action Required for UCR) 和 bit 56(S; Signaling an UCR error)才有效
- RIPV (restart IP valid) flag, bit 0 : Indicates (when set) that program execution can be restarted reliably at the instruction pointed to by the instruction pointer pushed on the stack when the machine-check exception is generated. (MCE發生時,堆疊內的指令指標是有效的因此程式可以restart)
- EIPV (error IP valid) flag, bit 1 : Indicates (when set) that the instruction pointed to by the instruction pointer pushed onto the stack when the machine check exception is generated is directly associated with the error. (MCE發生時,堆疊內的指令指標就是問題所在)
- MCIP (machine check in progress) flag, bit 2 : Indicates (when set) that a machine-check exception was generated.(MCE發生會舉這個bit) Software can set or clear this flag. The occurrence of a second Machine-Check Event while MCIP is set will cause the processor to enter a shutdown state.
- 只有所有bit全設為1(Enable MCE feature)或全設為0(Disable MCE feature)這兩種選項。
# Error-Reporting Bank Registers
1. IA32_MCi_CTL
- Each of the 64 flags (EEj) represents a potential error. Setting an EEj flag enables reporting of the associated error and clearing it disables reporting of the error.
2. IA32_MCi_STATUS
- VAL (valid) bit 63 : 這個bit為1代表這個register是有效的,software 負責清掉此bit
- EN (error enabled) flag, bit 60 — Indicates (when set) that the error was enabled by the associated EEj bit of the IA32_MCi_CTL register.
- 其他bit的定義請參考下圖 (詳細說明請參考spec)
3. IA32_MCi_CTL2
- 當MCG_CAP.CMCI_P(bit10)=1 時這個register才有效,請參考這個register如下圖
- CMCI_EN-Corrected error interrupt enable/disable/indicator, bits 30 : Software sets this bit to enable the generation of corrected machine-check error interrupt (CMCI).
- Corrected error count threshold, bits 14:0 : Software must initialize this field.(軟體決定要報 CMCI 的 threshold) The value is compared with the IA32_MCi_STATUS.Corrected Error Count (bit 58-32). An overflow event is signaled to the CMCI LVT entry (see Table 10-1) in the APIC when the count value equals the threshold value. (請參考下圖)
------------------------------------------------------------------------------------------------------------------
<Bank 的 Error classification>
<UCR Error Overwrite Rules>
In general, the overwrite rules are as follows:
- UCR errors will overwrite corrected errors.
- Uncorrected (PCC=1) errors overwrite UCR (PCC=0) errors.
- UCR errors are not written over previous UCR errors.
- Corrected errors do not write over previous UCR errors.
<INTERPRETING THE MCA ERROR CODES> 解讀MCA Error Code, 即 MCi_STATUS[15:0]
- Simple Error Code
- Compound Error Code
名詞解釋:
- Memory Scrubbing: ECC correct 後得到原本正確的值並將其寫回
- Explicit Writeback: When a cache decides to evict a dirty line, it generally issues an "explicit writeback" (指出哪裡(line)有問題並搬走資料)
<MCA進階>
MCE features在Build linux kernel的模組選單裡預設就是開啟的, 由此可見他的重要性
MCA發展至今已到第二代( Legacy MCA -> EMCA Gen1 -> EMCA Gen2),主要差別是Bios (Firmware First Mode; FFM) (透過SMI)介入時機與處理程度。
eMCA Gen2實現在Xeon® Processor E7 – v3,他除了定義了兩個新的MSR (CORE_SMI_ERR_SRC and UNCORE_SMI_ERR_SRC)並且使用FFM統一處理錯誤,根據不同的error程度(CE/UCE)發CSMI/MSMI進入SMM handler交給Bios處理。以下列出各代的差別
謝謝妳的整理,很有用! AJ
回覆刪除不客氣
回覆刪除寫得好清楚 厲害!
回覆刪除