銀行案例學(xué)習(xí)實(shí)例4_IV and WOE
python金融風(fēng)控評(píng)分卡模型和數(shù)據(jù)分析微專業(yè)課:http://dwz.date/b9vv

http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/
This is a continuation of our banking case study for scorecards development. In this part, we will discuss information value (IV) and weight of evidence. These concepts are useful for variable selection while developing credit scorecards. We will also learn how to use ?weight of evidence (WOE) in logistic regression modeling. The following are the links where you can find the previous three parts?(Part 1),?(Part 2)?&?(Part 3).
這是我們針對(duì)計(jì)分卡開(kāi)發(fā)的銀行業(yè)案例研究的延續(xù)。 在這一部分中,我們將討論信息價(jià)值(IV)和證據(jù)權(quán)重。 這些概念對(duì)于開(kāi)發(fā)信用計(jì)分卡時(shí)的變量選擇很有用。 我們還將學(xué)習(xí)如何在邏輯回歸建模中使用證據(jù)權(quán)重(WOE)。 以下是可以在其中找到前三個(gè)部分(第1部分),(第2部分)和(第3部分)的鏈接。
Experts in Expensive Suits昂貴西裝專家

A couple of weeks ago I was watching this show called ‘Brain Games’ on the National Geographic Channel. In one of the segments, they had a comedian dressed up as a television news reporter. He had a whole television camera crew along with him. He was informing the people coming out of a mall in California that Texas has decided to form an independent country, not part of the United States. Additionally, while on camera he was asking for their opinion on the matter. After the initial amusement, people took him seriously and started giving their serious viewpoints. This is the phenomenon psychologists describe as ‘expert fallacy’ or obeying authority, no matter how irrational the authorities seem. Later after learning the truth, the people on this show agreed that they believed this comedian because he was in an expensive suit with a TV crew.
Nate Silver in his book The Signal and The Noise described a similar phenomenon. He analyzed the forecasts made by the panel of experts on the TV program The McLaughlin Group. The forecasts turned out to be true only in 50% cases; you could have forecasted the same by tossing a coin. We do take experts in expensive suits seriously, don’t we? These?are not few-off examples. Men in suits or uniforms come in all different forms – from army generals to security personnel?in malls. We take them all very seriously.
We have just discovered that rather than accept an expert’s opinion, it would be better to look at the value of the information and make decisions oneself. Let us continue with the theme and try to explore how to assign the value to information using information value and weight of evidence. Then we will create a simple logistic regression model using WOE (weight of evidence). However, before that let us recapture the case study we are working on.
幾個(gè)星期前,我在國(guó)家地理頻道觀看這個(gè)名為“腦游戲”的節(jié)目。在其中一個(gè)片段中,他們有一個(gè)扮成電視新聞?dòng)浾叩南矂⊙輪T。他和他一起有一整個(gè)電視攝制組。他告訴從加利福尼亞州的一個(gè)商場(chǎng)出來(lái)的人們,德克薩斯州決定組建一個(gè)獨(dú)立的國(guó)家,而不是美國(guó)的一部分。此外,他在鏡頭前詢問(wèn)他們對(duì)此事的看法。在最初的娛樂(lè)之后,人們認(rèn)真地對(duì)待他并開(kāi)始給予他們認(rèn)真的觀點(diǎn)。這是心理學(xué)家所描述的“專家謬誤”或服從權(quán)威的現(xiàn)象,無(wú)論當(dāng)局看起來(lái)多么不合理。在得知真相之后,這個(gè)節(jié)目的人們同意他們相信這個(gè)喜劇演員,因?yàn)樗且粋€(gè)昂貴的電視工作人員。
Nate Silver在他的著作“信號(hào)與噪音”中描述了類似的現(xiàn)象。他分析了電視節(jié)目The McLaughlin Group的專家小組所做的預(yù)測(cè)。僅在50%的情況下,預(yù)測(cè)結(jié)果是正確的;你可以通過(guò)擲硬幣來(lái)預(yù)測(cè)同樣的事情。我們認(rèn)真對(duì)待昂貴西裝的專家,不是嗎?這些都不是很少的例子。穿西裝或制服的男子有各種形式 - 從軍隊(duì)將軍到商場(chǎng)的保安人員。我們非常重視他們。
我們剛剛發(fā)現(xiàn),不要接受專家的意見(jiàn),最好是查看信息的價(jià)值并自己做出決定。讓我們繼續(xù)討論主題,并嘗試探索如何使用信息值和證據(jù)權(quán)重為信息賦值。然后我們將使用WOE(證據(jù)權(quán)重)創(chuàng)建一個(gè)簡(jiǎn)單的邏輯回歸模型。但是,在此之前讓我們重新審視我們正在研究的案例研究。
Case Study Continues ..
This is a continuation of our case study on CyndiCat bank. The bank had disbursed 60816 auto loans with around 2.5% of the bad rate in the quarter between April–June 2012. We did some exploratory data analysis (EDA) using tools of data visualization in the first two parts?(Part 1)?&?(Part 2). In the previous article, we have developed a simple logistic regression model with just age as the variable?(Part 3). This time, we will continue from where we left in the previous article and use weight of evidence (WOE) for age to develop a new model. Additionally, we will also explore the predictive power of the variable (age) through information value.
信息價(jià)值是模型構(gòu)建過(guò)程中變量選擇的一個(gè)非常有用的概念。 我認(rèn)為,信息價(jià)值的根源在于克勞德·香農(nóng)提出的信息理論。 我相信的原因是相似性信息值與信息論中廣泛使用的熵概念有關(guān)。 Chi Square值是一種廣泛使用的統(tǒng)計(jì)量度量,是IV(信息值)的良好替代品。 然而,IV是業(yè)內(nèi)流行且廣泛使用的措施。 這樣做的原因是與IV相關(guān)的變量選擇的一些非常方便的經(jīng)驗(yàn)法則 - 這些非常方便,您將在本文后面發(fā)現(xiàn)。 信息值的公式如下所示。
Information Value (IV)?and Weight of Evidence (WOE)
Information value is a very useful concept for variable selection during model building. The roots of information value, I think, are in information theory proposed by Claude Shannon. The reason for my belief is the similarity information value has with a widely used concept of entropy in?information theory. Chi Square value, an extensively used measure in statistics, is a good replacement for IV (information value). However, IV is a popular and widely used measure in the industry. The reason for this is some very convenient rules of thumb for variables selection associated with IV – these are really?handy as you will discover later in this article. The formula for information value is shown below.
信息價(jià)值是模型構(gòu)建過(guò)程中變量選擇的一個(gè)非常有用的概念。 我認(rèn)為,信息價(jià)值的根源在于克勞德·香農(nóng)提出的信息理論。 我相信的原因是相似性信息值與信息論中廣泛使用的熵概念有關(guān)。 Chi Square值是一種廣泛使用的統(tǒng)計(jì)量度量,是IV(信息值)的良好替代品。 然而,IV是業(yè)內(nèi)流行且廣泛使用的措施。 這樣做的原因是與IV相關(guān)的變量選擇的一些非常方便的經(jīng)驗(yàn)法則 - 這些非常方便,您將在本文后面發(fā)現(xiàn)。 信息值的公式如下所示。

What distribution good/bad mean will soon be clear when we will calculate IV for our case study. This is probably an opportune moment to define Weight of Evidence (WOE), which is the log component in information value.

Hence, IV can further be written as the following.

If you examine both information value and weight of evidence carefully then you will notice that both these values will break down when either the distribution good or bad goes to zero. A mathematician will hate it. The assumption, a fair one, is that this will never happen while a scorecard development because of the reasonable sample size. A word of caution, if you are developing non-standardized scorecards with smaller sample size use IV carefully.
如果仔細(xì)檢查信息的價(jià)值和證據(jù)的重量,那么你會(huì)注意到,當(dāng)分布好壞都?xì)w零時(shí),這兩個(gè)值都會(huì)崩潰。 數(shù)學(xué)家會(huì)討厭它。 假設(shè)是合理的,因?yàn)楹侠淼臉颖玖?,在記分卡開(kāi)發(fā)時(shí)這種情況永遠(yuǎn)不會(huì)發(fā)生。 需要注意的是,如果您正在開(kāi)發(fā)樣本量較小的非標(biāo)準(zhǔn)化記分卡,請(qǐng)謹(jǐn)慎使用IV。
Back to the Case Study
In the previous article, we have created coarse classes for the variable age in our case study. Now, let us calculate both information value and weight of evidence for these coarse classes.在上一篇文章中,我們?cè)诎咐芯恐袨榭勺兡挲g創(chuàng)建了粗糙的類。 現(xiàn)在,讓我們計(jì)算這些粗略分類的信息價(jià)值和證據(jù)權(quán)重。

Let us examine this table. Here, distribution of loans is the ratio of loans for a coarse class to total loans. For the group 21-30, this is 4821/60801 = 0.079. Similarly, distribution bad (DB) = 206/1522 = .135 and distribution good = 4615/59279 (DG) = 0.078. Additionally, DG-DB = 0.078 – 0.135 = – 0.057. Further, WOE = ln(0.078/0.135) = -0.553.
讓我們檢查一下這張表。 在這里,貸款分配是粗略貸款與總貸款之比。 對(duì)于21-30組,這是4821/60801 = 0.079。 同樣,分布不良(DB)= 206/1522 = .135,分布良好= 4615/59279(DG)= 0.078。 此外,DG-DB = 0.078 – 0.135 = – 0.057。 此外,WOE = ln(0.078 / 0.135)=-0.553。

Download the attached Excel to understand this?calculation :?Information Value (IV)?and Weight of Evidence (WOE)
下載隨附的Excel以了解此計(jì)算:信息值(IV)和證據(jù)權(quán)重(WOE)

Finally, component of IV for this group is (-0.057)*(-0.553) = 0.0318.?Similarly, calculate the IV components for all the other coarse classes. Adding these components will produce the IV value of 0.1093 (last column of the table). Now the question is how to interpret this value of IV? ?The answer is the rule of thumb described below.

信息價(jià)值預(yù)測(cè)能力
<0.02無(wú)法用于預(yù)測(cè)
0.02到0.1弱預(yù)測(cè)值
0.1到0.3中等預(yù)測(cè)值
0.3到0.5強(qiáng)預(yù)測(cè)器
? > 0.5可疑或太好不可能
Typically, variables with medium and strong predictive powers are selected for model development. ?However, some school of thoughts would advocate just the variables with medium IVs for a broad-based model development. Notice, the information value for age is 0.1093 hence it is barely falling in the medium predictors’ range.
通常,選擇具有中等和強(qiáng)預(yù)測(cè)能力的變量用于模型開(kāi)發(fā)。 然而,一些學(xué)派只會(huì)提倡具有中等IV的變量來(lái)進(jìn)行基礎(chǔ)廣泛的模型開(kāi)發(fā)。 請(qǐng)注意,年齡的信息值為0.1093,因此在中期預(yù)測(cè)器的范圍內(nèi)幾乎沒(méi)有下降。
Logistic Regression with Weight of Evidence (WOE)
Finally, let us create a logistic regression model with weight of evidence of the coarse classes as the value for the independent variable age. The following are the results generated through a statistical software.
最后,讓我們創(chuàng)建一個(gè)邏輯回歸模型,其中粗類的證據(jù)權(quán)重作為自變量年齡的值。 以下是通過(guò)統(tǒng)計(jì)軟件生成的結(jié)果。

If we estimate the value of bad rate for the age group 21-30 using the above information.

This is precisely the value we have obtained the last time?(See the previous part)?and is consistent with the bad rate for the group.
Sign-off note
I wish there was an instrument similar to information value available with us to estimate the value of information coming from so called experts. However, next time when an expert on a business channel gives you the advice to buy a certain stock, take that advice with a pinch of salt.
我希望有一種類似于信息價(jià)值的工具可用于估算來(lái)自所謂專家的信息的價(jià)值。 但是,下次商業(yè)渠道專家為您提供購(gòu)買某種庫(kù)存的建議時(shí),請(qǐng)盡量不予理睬。
Read the remaining part of credit scoring series
Part 1:?Data visualization for scoring
Part 2:?Creating ratio variables for better scoring
Part 3:?Logistic regression
Part 5:?Reject inference
Part 6:?Population stability index for scorecard monitoring
References1. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring – Naeem Siddiqi 2. Credit Scoring for Risk Managers: The Handbook for Lenders – Elizabeth Mays and Niall Lynas
up主微信公眾號(hào)pythonEducation
博主網(wǎng)校主頁(yè) :http://dwz.date/bwes
