Endorsed by Curators:
ACM Data Science Camp 2017, Silicon Valley
Endorsed by Curators:
Organized through www.SFbayACM.org, the San Francisco Bay Area Association of Computing Machinery. We are a 501c(3) non-profit, run by unpaid volunteers, running this as a fundraiser.
We are seeking TA's who know R to help the audience. TA applicants should contact the instructor in advance. On the SFbayACM Meetup page for this event, usethe [contact] button on the left, send email, phone, LinkedIn and R experience).
8 HR CLASS - SUMMARY (detailed outline follows) Go through a sprint of a predictive data mining project, introducing R as we go. Review the training process for regression, backpropagation neural nets, decision trees and XGboost. Introduce R data.tables and the caret interface to 233 predictive algorithms. Focus on strategies to structure a successful project design and data pull. Review a variety of preprocessing and knowledge representation. Provide questions you can take away and apply to the design of your future projects, to describe models to clients (sensitivity analysis code included) and to manage models over their natural lifecycle. Introduce R + Spark integrations, and show an example R Shiny web GUI interface.
TARGET AUDIENCEwould include people who ...
are comfortable programming
may already work on consulting projects or in some technical business problem solving role.
It is helpful if you have tried R, or some basic exposure to R before the class can help. The focus is much more on "being successful with deploying Data Mining".
COURSE DESIGN:The instructor does not want to repeat "R in a Nutshell" or training that goes "sequential and broad" (i.e. everything about data structure X, then everything about feature Y). That material is great for a larger training time frame. For students to get the most out of a one day class, Ithe instructor is focusing on a "narrow" path, like a project sprint, going through a complete set of steps in a data mining project. Many pointers will be provided to invite you to broaden your skills more after the class.
The instructor likes theCovey quote "If the ladder is not leaning against the right wall, every step we take just gets us to the wrong place faster." A successful data mining project is not just coding and executing a function. Design is crucial. There is a gap that is not covered by Kaggle experience or starting with a ready-made data set. The instructorfocuses on covering general strategies that you can take away as questions you can ask about your upcoming project, such as how to identify projects, how to structure a project for success.
CLASS DETAILED OUTLINE
Part 1: Get started and play with your data
Overview (and Lab 1a) of R studio, basics of variables, lists, read a CSV file into a data table, find out the ways to look and manipulate the table. Discuss the HMEQ (Home Equity) data. The problem is to predict if the person would be good or bad loan.
Discuss a comparison / contrast for a few data mining algorithms: Regression, neural nets, decision trees, XGboost and ensemble models. Train a first decision tree on an existing training set (Lab 1b). Go toTensorFlow Playgroundto try setting some neural net parameters and training them on different data set.
Part 2: Data Science Project Design
Data Mining Project Design and Objectives (accurate, general, understandable)
Designing the training data to represent the production scoring data in the future.
Retraining Frequency (daily or re-evaluate monthly)
Reference Dates (separate analysis past from future)
Target and Weight Variable Variations
Business Metrics to Optimize, lift tables
Big Data Production, Lambda and Kappa Architecture
R data.tables lecture/Lab 2on the HMEQ data table. Show the analogy with SQL, selecting rows, creating columns, aggregation. Writing a small function, R macros get you unstuck and help scale in complexity.
Part 3: Preprocessing Design - Simple to Complex
Review math requirements on input data - by algorithms. Focus on preparing a data set that can get loaded in most any algorithm.
Missing data handling (simple to sophisticated)
Convert rules, queries or func. to detector fields  to capture use cases of behavior
Convert observed frequency of normal to rareness detectors - for fraud detection.
Lab 3:preprocessing your HMEQ data
Fit linear models to time series within a record to extrap.
Time series: detect individual past behavior to adapt future estimate
Dont ignore input variables with 20+ categories, use DBC (Dependent by Category)
Variable interactions: not A*B, DBC tables, clusters
Part 4: Modeling Design, Spark
Model Notebook to track, plan design of experiments and to automate
Sensitivity Analysis: describe modelsor model ensembles overall. Provide record level reasons. Explain how to detect model drift over time, and describe why.
Lab 4:training models, evaluating,run sensitivity analysis with provided sensitivity code.
Discuss additional topics: Review available R + Spark combinations. (Apache SparkR, RStudio's SparklyR, IBM's R4ML). Time permitting, discuss R web GUI's with Shiny & Shiny Dashboards, RStudio's TensorFlow for R.
BEFORE THE CLASS, PREPARATIONS: The class uses RStudio, the IDE which is what you would use for typical R data mining projects at work.
ThisUCLA R Studio Tutorial linkdocuments the following steps, which be helpful before you come to the class. It is recommend to go over both the Installation and the short Basic Tutorial (if you don't already have this knowledge).
Install R 3.3.3 or laterhttps://cran.r-project.org/
InstallRStudio, DesktopIDE (free)
If you install on Windows, it is strongly recommend you usethis link to enable R to use your available memory, with --max-mem-size=xxxxMB. Install the devtools package.
Install R libraries: data.table, Hmisc, gmodels, e1071, doMC (if you are on a Mac or Unix), doParallel (if on Windows), caret, rpart, randomForest, partykit, pROC, nnet, xgboost, ggplot2, zoo. (Check a week before the class, the list may get updated).
For fun, play around with some neural nets at theTensorFlow Playground. This will be covered in the class as well.
You are invited to submit a description of your upcoming predictive projects or vertical applicaitons. The instructor will review and may try to incorporate some ideas in the class. Through the SFbayACM meetup site for this event, on the left margin, use the[contact]button.
8:00 - 8:30 arrive, register, coffee, network
8:30 - 10:30 lecture / lab
15 min break, coffee
10:45 - 12:45 lecture / lab
45 min break for lunch
1:30 - 3:30 lecture / lab
15 min break, coffee, small snacks
3:45 - 6:00 lecture / lab
15 min Q&A
ABOUT THE SPEAKER:
Greg Makowski has been deploying data mining models for 25 years (before the terms Data Science or Data Mining) as the "neural net guy" at American Express/Epsilon. He likes to "begin with the end" with the business decisions and values to be made by the analytic system, the job function to be complimented and by the deployment constraints. He has developed the analytic internals and automation for 6+ enterprise software systems or SaaS systems. His first convolutional neural net was trained in 1991, a Time Delay Neural Net for speech recognition. Vertical experience includes financial services (credit card, retail banking, bond pricing, ACH payments, fraud detection, customer relationship management (mail, phone, email, banner), retail supply chain among others. He always has something to learn from everybody.
Endorsed by Curators: