首页 > 技术文章 > Capital one TPS

ffeng0312 2019-01-16 03:32 原文

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statisticspattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements.[1][2] However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable (i.e. the class label).[3] Logistic regressionand probit regression are more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of continuous independent variables. These other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method.

LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data.[4] LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made.

 

Credit Card Fraud Detection 7 times from 2015 to 2017

What machine learning model would you use to classify fraudulent transactions on credit cards?

need to read:

https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets

Done:

https://www.kaggle.com/joparga3/in-depth-skewed-data-classif-93-recall-acc-now

https://www.kaggle.com/gargmanish/how-to-handle-imbalance-data-study-in-detail

https://www.kaggle.com/pavansanagapati/anomaly-detection-credit-card-fraud-analysis

Our goal is to detect 100% of the fraud while minimizing the incorrecr fraud classification.

n  If we have historical data with the label which indicates if a transaction is fraud or not, we can use classification method. If the fraction of fraud trasaction is too small or there is no label, we can use anomly detection techqniue.

how to use classification method, which one is good to use? Later there will also be a problem which method is the least useful. 

n  Logistic regression, SVM, decision tree, random forest

n  The logistic regression will be the least useful, because it’s a simple model and it’s not good for complex problems. Logistic regression tends to underperform when there are multiple or non-linear decision boundaries. They are not flexible enough to naturally capture more complex relationships. Logistic regression is sensitive to outliers(lasso regression is helpful to deal with outliers), it’s not good when there is perfect multicollinearity(ridge regression is helpful to deal with mullticolinearity).

 

  1. Exploratory data analysis:

Size of data, how many rows and how many features using df.shape

Univariate Analysis: for numerical variable use describe to check the basic descriptive statistics that summarize the central tendency, dispersion and shape of distribution;

-       Box plot or histogram to check outliers

       for catgorical variable check frequency by using frequency table, bar chart or pie chart.

-       Determine the number of fraud transactions, I believe most of the transactions were Non-Fraud (99.83%) of the time and only small number of transactions are fraud. That means the data is highly unbalanced with respect with target variable.

n  How would you handle unbalanced data?

1)      Change performance metric to confusion matrix, precision, recall, F1 score, AUC

2)      Oversampling – randomly copy the insatnce from monority class. or synthetic minority over-samping technique(SMOT), which use knn to create new data instance for minority class

3)      Undersampling when have a lot data, by keeping whole data of minority class but randomly sampling a proportion of majority class.

The traditional undersampling technique is that we will under sample the dataset will be by creating a 50/50 ratio. This will be done by randomly selecting "x" amount of sample from the majority class, being "x" the total number of records with the minority class. Also, we can resample our data with different size, by taking different proprtion of majority class and taking whole data of minority class.

4)      Penalized model – in SVM we can use argument class_weight=’balanced’ to penalize mistakes on the minority class by an amount proportional to how under-represented it is

5)      Try a different persepective – anomly detection or outlier detection

Bivariate analysis: check the correlation between two variables using correlation matrix

     categorical variable and numerical variable

-       Use grouped boxplot to check how different the amount of money used in different transaction classes are

                                         Numerical variable and numerical variable

-       Use scatterplot to plot fraudulent transactions amount against time

     categorical variable and categorical variable

-       Use stacked bar chart to check correlation between categorical variable and categorical variable

-- Multicollinearity for logistic regression

  • Multicollinearity reduces the precision of the estimate coefficients. You might not be able to trust the p-values to identify independent variables that are statistically significant.

How to detect:

  • Review scatterplot
  • Run a correlation matrix
  • To test for instability of the coefficients, we can run the regression on different combinations of the variables and see how much the estimates change or if the sign changed.
  • Variance inflation factor (VIF) identifies correlation between independent variables and the strength of that correlation.

-- What is VIF (in regression output)?

When do we need to fix it?

  • VIFs start at 1 and a value of 1 indicates that there is no correlation between this independent variable and any others. VIFs between 1 and 5 suggest that there is a moderate correlation, but it is not severe enough to resolve it. VIFs greater than 5 represent critical levels of multicollinearity where the coefficients are poorly estimated, and the p-values are questionable.
  • Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not present for the independent variables that you are particularly interested in, you may not need to resolve it.
  • Multicollinearity not perfect multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.
Python - from statsmodels.stats.outliers_influence import variance_inflation_factor

How to Deal with Multicollinearity

  • Reduce structural multicollinearity by centering the variables, it is also known as standardizing the variables by subtracting the mean.
  • Remove some of the highly correlated independent variables
  • Principle components analysis

2. Data Preperation: duplicates, missing value, outliers, normalized, resampling the data and split data

- Most of the times, the dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Euclidean distance between two data points in their computations, this is a problem. Rule of thumb I follow here is any algorithm that computes distance or assumes normality, we need to normalize the features, such as KNN and PCA. While tree based models and Naive Bayes don't require normalization.

n  How would you handle missing or garbage data?

n  How would you handle outliers?

--  sampling data and split data

We can try both oversampling and undersampling methods and also resampling our data with different size. and then we will try to use this resampled data to train our model, and then we will use this model to predict for our original data.

After we got the resampled data set, we randomly split the resampled data set into training dataset and test dataset. Also, we need randomly split the whole dataset into training dataset and test set. We will train the model on the undersampling training dataset and use undersampling test dataset to evaluate the model, and then we can apply the model built on undersample training data to test dataset splited from whole dataset to see how the model works on a much larger and skewed dataset. 

Sometimes oversampling is better than the Under sampling because on Under sampling we were loosing a large amount of data or we can say a good amount of information so why the there precision was very low.

3. feature engineering: select features or create features based on existing variable.

-- How would you use existing features to add new features?

Transform date to days since last purchase, date to month or season

-- Random forest can show to relative feature importance.

 4. Fit the model

 5. Evaluate the model

-- use grid search and k-fold cross validation to tuning the hyperparameters and select the best model

-- Evaluation metric

We should use precision-recall curve, because in this case as our problems relies on the "positive" class being more interesting than the negative class and positive samples are very rare.

Let’s take an example of fraud detection problem where there are 100 frauds out of 2 million samples.

Algorithm 1: 90 relevant out of 100 identified

Algorithm 2: 90 relevant out of 1000 identified

Evidently, algorithm 1 is more preferable because it identified less number of false positive.

In the context of ROC curve,

Algorithm 1: TPR=90/100=0.9, FPR= 10/1,999,900=0.00000500025

Algorithm 2: TPR=90/100=0.9, FPR=910/1,999,900=0.00045502275

The FPR difference is 0.0004500225

For PR, Curve

Algorithm 1: precision=0.9, recall=0.9

Algorithm 2: Precision=90/1000=0.09, recall= 0.9

Precision difference= 0.81

-- false positive/false negative - Are false positives or false negatives more important? What is the effect of FP and FN?

For fraud detection, we want to detect all frauds, so we want to maximize the true positive and minimize false negative. FN is more important than FP in this case, but meanwhile we want the model is also predicting as a whole correctly and not making many errors, so we don't want FP very high.

-- we can tweak the logistic model by changing the threshold

  • When we use the "predict()" method, it decides whether a record should belong to "1" or "0".
  • There is another method "predict_proba()".
  • This method returns the probabilities for each class. The idea is that by changing the threshold to assign a record to class 1, we can control precision and recall.
  • By plotting precision-recall curve and can see the performance of the model depending on the threshold we choose, so that we can finda sweet spot where recall is high enough whilst keeping a high precision value.

 

target missing

 

potential issues

bias variance trade off - What does regularization do?

Logistic regression, random forests

Difference between random forest and gradient boosted tree.

Anomaly detection/novelty detection techniques might be also helpful because of the huge data imbalance that normally exists in such scenarios.

Asked a lot of possible problems with the model and how should you deal with that when time is limited.

Couple things to keep in mind regarding fraud:
1) you're dealing with an imbalanced data set (your fraud cases may be 3-5% of all your data). So, consider either oversampling, or giving higher weight to your fraud cases.
2) you data may not have all the true fraud cases - in other words, there maybe actual fraud cases not captured in your data. So, some form of anomaly detection may be needed.

 

预测用户是否会注销信用卡 -3 times in 2018

如果给你一堆dataset,比如信用卡一年的交易记录、客户个人信息,银行想预测客户会不会在一个月之内关户,如果会的话,银行打算发一点cashback rewards给这些人挽留一下。让你建模预关户。  以下是面试官的问题:

1.        你会选哪些feature?(感觉是随便说,只要有关系。追问如果是一堆transaction的日期之类的,应该怎样rebuild feature)
2.        怎么做data cleaning: 
    a.            怎样detect outlier?. From 1point 3acres bbs
    b.            怎样fill in missing data?(我说可以填constant比如mean,然后他追问填mean在什么情况下不合适、怎样更好)
    c.            如果target value也missing了怎么办
3.        你选什么model?(我说decision tree,然后他让我说有没有其他model,优缺点分别是什么,target是什么。target应该是一个binary的值whether the customer will close the account in one month,如果regression得到了0~1之间的值就代表how likely)
4.        怎么看model 的performance,用什么package. From 1point 3acres bbs
5.        如果data size很大有1TB,怎样sample,用什么package. From 1point 3acres bbs
6.        如果model不准确,会给银行造成什么损失?
7.        如果用model predict得到了一堆target的值,应该怎样根据target发rewards (我说画个distribution,给最可能关户的百分之几客户发rewards。追问除了这种方式还有什么方式,我也不确定是考modeling还是business sense)
8.        最后一个是地里看到的一模一样的open question,两人都有5000limit,但是一个用100%一个只用2%,这两人有没有可能都在一月之内关户。面试官应该看你第一反应是考虑model的问题还是考虑其他方面。

从feature engineering 到 最后 model tuning and validation 的所有步骤。

如何建model,用了哪些parameter,结果如何 还有为什么要选这个model

credit card churn model
      1. Feature engineering,比如从start date算出tenure 等等
      2. Missing value
      3. 用什么模型,为什么
      4. 现在数据量加大,怎么办?spark。如果你要选,用RSpark还是PySpark?为什么
      5. 现在模型output出来,一个credit limit 使用率0%的用户和使用率95%的用户都很危险,都很可能马上就关掉信用卡,你会怎么处理?我回答churn model是起点,一般marketing department会根据churn model的结果设计retention program。对于这两类危险用户,需要设计不同的incentive plan。
             1)使用率0%的用户,基本上很难挽回。
             2)使用率95%的用户大概率可以挽回,降低利率,增加cashback等等。。。
             3)可以根据测试结果再搞个uplift model,看哪些high churn users可以挽回的,着重施加treatment。

  • tell me some useful packages you use in R/python?  1 Answer
  • how do you detect multicollinearity?  1 Answer
  • how do you join two data sets?  

 

Other questions:

  • our sever run cost is xxx, 其他固定成本是xxx,能容纳xxx TB流量。 我们大概有xxx个客户,每个客户交付给我们server使用费为xxx/month。我们给每个用户分配xxxGB,但是平均每个用户只会用掉期中的xx%,所以我们可以把剩下的空间再去接纳更多的客户。问:每年盈利是多少?现有另外一种server b, cost is xxx,capacity is xxx。。。请权衡比较我们要不要把已有server换成server b-baidu 
  • 题目是有一个运动产品的零售商,来找你优化他们的在线广告竞拍系统,提高response rate。假设你有的数据是3, 000, 000用户的访问数据,每行数据有150多个column,已知overall的response rate是1/1000。被问的问题有:
    1. 选什么作为target?
    Response or not
    2. 选什么metrics?
    AUC-ROC
    3. 怎么处理NA? 
    It depends. If NA is meaningful, leave it there. If NA is missing due to data extracation, do some simple if-else condition/mean(median)/regression to fill
    4. 怎么做feature engineering? 
    Encode categorical varaible, use 'groupby' and 'mean/medium/std' to generate some features
    4. 数据量特别大怎么办?
    mapreduce,但是我没用过,就拿本地并行优化举了个例子,怎么分配数据给各个线程,然后怎么把数据收回来合并。
    5. 模型用什么?
    GBDT,lightGBM/XGB
    6. 怎么评估模型表现?
    k-fold CV
    7. Overfitting/underfitting怎么办?
    分别讨论了一下。想办法获取更多的数据,调整hyper-parameter。
    8. 如果模型预测出了问题,会有什么影响?
    分情况讨论了一下整体上会有什么变化,对单个用户有什么影响。

 

  • Given a dataset, how would you model it to extract a particular information. How would you architect the pipeline.

 

推荐阅读