Understanding the Problem

The problem being tackled is that of fraud detection. Your goal is that of helping in the task of detecting fraudulent credit fraud transactions. You will use a data set shared by well-known company.

As a strong suggestion, you should focus on the first steps in any data mining project: exploring the data with the goal of obtaining a better understanding of the problem/task. Details concerning the data set and suggestions on how to tackle this assignment are available at the Kaggle competition webpage which will be used for the second assignment - here

Objective

The objective of this practical assignment is to provide informed and well-documented advice on which predictive modelling solution to use in order to accurately detect fraudulent credit card transactions (variable Class).

You should deliver a report describing the procedure employed to decide which model you think is the best for the objective of predicting the class of the credit card transactions. Your report should describe all the steps you have followed, in order to come to the conclusion of which you think is the best model.

My main interest is not in knowing what is the model you are recommending, but more on the evaluation procedure used to reach this advice. This means I want to know which procedure you propose to recommend a model given: (i) a set of alternative models you are familiar with; and (ii) a concrete predictive task, where the main focus is to predict the under-represented (rare) cases of fraudulent activity.

What to deliver?

You need to deliver a dynamic report (an Rmd file) that describes your analysis. You should make sure that anyone that has access to the data set file will be able to run your report (it is acceptable that you require the target reader to have R and a set of packages installed). This means that if I have your Rmd file and the R packages you declare as necessary for your report, I should be able to compile the report on RStudio.

In case there are parts of your model selection procedure that take too much time and/or computational resources, it is OK that you include the code of these parts with eval=FALSE in the respective code chunk, run the code “outside” of the report, and then load the results of running the code in the report. In that case you need to send me not only your report but also any auxiliary files your report loads. If that is the case, gather all the files in a ZIP and send it to me.

The Rmd file (+ auxiliary files) can be submitted through email to me (nuno.moniz@fc.up.pt).

Deadline (strict)

Code Snippet to load the data

library(data.table) # install.packages("data.table")

set.seed() # input a random number of your choice

bulk <- fread("train_v2.csv",sep=",",header=TRUE)

ids <- sample(1:nrow(bulk),0.01*nrow(bulk))

ds <- bulk[ids,]

nrow(ds[ds$isfraud==1,])/nrow(ds)

# if you prefer to handle the sample as a data.frame:

ds <- as.data.frame(ds) 

Dummy example for baseline submission

library(data.table)

train <- fread("train_v2.csv",sep=",",header=TRUE)
test <- fread("test_v2.csv",sep=",",header=TRUE)

ids <- sample(1:nrow(train),0.01*nrow(train))
ds <- train[ids,]

# if you prefer to handle the sample as a data.frame:
ds <- as.data.frame(ds) 

library(ranger)

ds_tr <- ds[,8:ncol(train)]
ds_ts <- test[,8:ncol(test)]

m <- ranger(isfraud ~ ., ds_tr)
p <- predict(m, ds_ts)

preds <- data.frame(id=test$id,isfraud=p$predictions)
write.csv(preds,file = "preds1.csv", row.names=FALSE)
head(preds)