Agenda Discovery Week Month

Curated for Me

Predictive Data Science in R

Write a Review
Select your rating. ( ) ( ) ( ) ( ) ( )
Endorsed by Curators:
Sep 16 8:00AM - 6:00PM

Organized through, the San Francisco Bay Area Association of Computing Machinery. We are a 501c(3) non-profit, run by unpaid volunteers, running this as a fundraiser.

We are seeking TA's who know R to help the audience. TA applicants should contact the instructor in advance. On the SFbayACM Meetup page for this event, usethe [contact] button on the left, send email, phone, LinkedIn and R experience).

8 HR CLASS - SUMMARY (detailed outline follows) Go through a sprint of a predictive data mining project, introducing R as we go. Review the training process for regression, backpropagation neural nets, decision trees and XGboost. Introduce R data.tables and the caret interface to 233 predictive algorithms. Focus on strategies to structure a successful project design and data pull. Review a variety of preprocessing and knowledge representation. Provide questions you can take away and apply to the design of your future projects, to describe models to clients (sensitivity analysis code included) and to manage models over their natural lifecycle. Introduce R + Spark integrations, and show an example R Shiny web GUI interface.

TARGET AUDIENCEwould include people who ...

are comfortable programming

may already work on consulting projects or in some technical business problem solving role.

It is helpful if you have tried R, or some basic exposure to R before the class can help. The focus is much more on "being successful with deploying Data Mining".

COURSE DESIGN:The instructor does not want to repeat "R in a Nutshell" or training that goes "sequential and broad" (i.e. everything about data structure X, then everything about feature Y). That material is great for a larger training time frame. For students to get the most out of a one day class, Ithe instructor is focusing on a "narrow" path, like a project sprint, going through a complete set of steps in a data mining project. Many pointers will be provided to invite you to broaden your skills more after the class.

The instructor likes theCovey quote "If the ladder is not leaning against the right wall, every step we take just gets us to the wrong place faster." A successful data mining project is not just coding and executing a function. Design is crucial. There is a gap that is not covered by Kaggle experience or starting with a ready-made data set. The instructorfocuses on covering general strategies that you can take away as questions you can ask about your upcoming project, such as how to identify projects, how to structure a project for success.


Part 1: Get started and play with your data

Overview (and Lab 1a) of R studio, basics of variables, lists, read a CSV file into a data table, find out the ways to look and manipulate the table. Discuss the HMEQ (Home Equity) data. The problem is to predict if the person would be good or bad loan.

Discuss a comparison / contrast for a few data mining algorithms: Regression, neural nets, decision trees, XGboost and ensemble models. Train a first decision tree on an existing training set (Lab 1b). Go toTensorFlow Playgroundto try setting some neural net parameters and training them on different data set.

Part 2: Data Science Project Design

Data Mining Project Design and Objectives (accurate, general, understandable)

Designing the training data to represent the production scoring data in the future.

Retraining Frequency (daily or re-evaluate monthly)

Reference Dates (separate analysis past from future)

Target and Weight Variable Variations

Business Metrics to Optimize, lift tables

Big Data Production, Lambda and Kappa Architecture

R data.tables lecture/Lab 2on the HMEQ data table. Show the analogy with SQL, selecting rows, creating columns, aggregation. Writing a small function, R macros get you unstuck and help scale in complexity.

Part 3: Preprocessing Design - Simple to Complex

Review math requirements on input data - by algorithms. Focus on preparing a data set that can get loaded in most any algorithm.

Missing data handling (simple to sophisticated)

Convert rules, queries or func. to detector fields [01] to capture use cases of behavior

Convert observed frequency of normal to rareness detectors - for fraud detection.

Lab 3:preprocessing your HMEQ data

Fit linear models to time series within a record to extrap.

Time series: detect individual past behavior to adapt future estimate

Dont ignore input variables with 20+ categories, use DBC (Dependent by Category)

Variable interactions: not A*B, DBC tables, clusters

Part 4: Modeling Design, Spark

Model Notebook to track, plan design of experiments and to automate

Sensitivity Analysis: describe modelsor model ensembles overall. Provide record level reasons. Explain how to detect model drift over time, and describe why.

Lab 4:training models, evaluating,run sensitivity analysis with provided sensitivity code.

Discuss additional topics: Review available R + Spark combinations. (Apache SparkR, RStudio's SparklyR, IBM's R4ML). Time permitting, discuss R web GUI's with Shiny & Shiny Dashboards, RStudio's TensorFlow for R.

BEFORE THE CLASS, PREPARATIONS: The class uses RStudio, the IDE which is what you would use for typical R data mining projects at work.

ThisUCLA R Studio Tutorial linkdocuments the following steps, which be helpful before you come to the class. It is recommend to go over both the Installation and the short Basic Tutorial (if you don't already have this knowledge).

Install R 3.3.3 or later

InstallRStudio, DesktopIDE (free)

If you install on Windows, it is strongly recommend you usethis link to enable R to use your available memory, with --max-mem-size=xxxxMB. Install the devtools package.

Install R libraries: data.table, Hmisc, gmodels, e1071, doMC (if you are on a Mac or Unix), doParallel (if on Windows), caret, rpart, randomForest, partykit, pROC, nnet, xgboost, ggplot2, zoo. (Check a week before the class, the list may get updated).

For fun, play around with some neural nets at theTensorFlow Playground. This will be covered in the class as well.

You are invited to submit a description of your upcoming predictive projects or vertical applicaitons. The instructor will review and may try to incorporate some ideas in the class. Through the SFbayACM meetup site for this event, on the left margin, use the[contact]button.


8:00 - 8:30 arrive, register, coffee, network

8:30 - 10:30 lecture / lab

15 min break, coffee

10:45 - 12:45 lecture / lab

45 min break for lunch

1:30 - 3:30 lecture / lab

15 min break, coffee, small snacks

3:45 - 6:00 lecture / lab

15 min Q&A


Greg Makowski's face

Greg Makowski has been deploying data mining models for 25 years (before the terms Data Science or Data Mining) as the "neural net guy" at American Express/Epsilon. He likes to "begin with the end" with the business decisions and values to be made by the analytic system, the job function to be complimented and by the deployment constraints. He has developed the analytic internals and automation for 6+ enterprise software systems or SaaS systems. His first convolutional neural net was trained in 1991, a Time Delay Neural Net for speech recognition. Vertical experience includes financial services (credit card, retail banking, bond pricing, ACH payments, fraud detection, customer relationship management (mail, phone, email, banner), retail supply chain among others. He always has something to learn from everybody.

Upcoming Events

Write a Review
Select your rating. ( ) ( ) ( ) ( ) ( )
Endorsed by Curators:
ACM Data Science Camp 2017, Silicon Valley

ACM Data Science Camp 2017, Silicon Valley

Oct 14 8:00AM - 6:30PM
Overview:Data Science Camp is SF Bay ACM's annual event combining sessions, keynote, and optional tutorial (extra-fee). It's an excellent opportunity to learn about Data Science and connect with…