Avoiding embarrassment by testing data assumptions with expectdata

Expectdata is an R package that makes it easy to test assumptions about a data frame before conducting analyses. Below is a concise tour of some of the data assumptions expectdata can test for you. For example,

Note: assertr is an ropensci project that aims to have similar functionality. Pros and cons haven’t been evaluated yet, but ropensci is a big pro for assertR.

Mining Sent Email for Self-Knowledge

How can we use data analytics to increase our self-knowledge? Along with biofeedback from digital devices like FitBit, less structured sources such as sent emails can provide insights.

E.g. here it seems my communication took a sudden more positive turn in 2013. Let’s see what else shakes out of my sent email corpus.

monthly_sentiment

ImageNet needs more Wild Boar Photos

Is your deep convolutional network misclassifying images? You can find out why with a heatmap of class activation overlaid on its misclassified pictures.

A heatmap overlay shows parts of an image most activated in a neural network’s last convolutional layer. In this African elephant picture, the top-most convolutional layer of the VGG16 architecture turns the photo into a 14x14 grid highlighting blocks with strongest African_elephant activation:

af_el_1

Ghosts of Animals Haunt Portland's Overpriced Apartments

Apartment hunting in an expensive city is leading me to curses and exclamations. Following are some outstanding examples of insanely priced apartments in Portland, OR, run through Google Deep Dream in hopes of my understanding why people pay so much for a small box. These listings will be gone in no time (I’m sure) so including some captions for posterity.

Let’s start with this one. Indeed, it appears $1899 for 1 bedroom grants access to this clubhouse haunted by some floating apparition:

clubhousedd

Clustering NHL Goalies

This has been a great Stanley Cup playoffs for Washington Capitals fans such as myself. With so much breath holding, I’ve paid more attention this year than recent years. As a former goalie myself, my curiosity grew towards: Who are these goalies so much better than me they beat me to being in the NHL? Who’s a hero, and who’s maybe not a keeper?

What better way to understand how they shake out than clustering their regular season statistics? This is an opportunity to work with tibbleColumns by Hoyt Emerson, a new package that adds some intriguing functionality to dplyr, and dendextend by Tal Galili, which adds options to hierarchical clustering diagrams. Best data found came from Rob Vollman at http://www.hockeyabstract.com/testimonials.

Bottom line up front: unsupervised learning here taught more about my data set, and less about the world it represents. Where it did teach about the world of NHL goalies, it showed this guy, is standing out:

Frederick Andersen

Best city for data scientists today according to two variables harvested with rvest

Jobs vs. Cost of Living by City

Some cities are more appealing for a data scientist to live than others. Several websites list best cities for data scientists, but their lists don’t agree and their methods are not explained, so the quality of the analysis and ability to determine which individuals their results can be inferred to are limited.

So here I set up to develop a reproducible, if not quite complete, measure of data scientist city attractiveness. Number of jobs for data scientists and cost of living may be two important variables. Using R’s rvest package, we can scrape from the web necessary information to get an idea how cities look in terms of these two. Got to DSCA index v0.1 shown above with top 21 cities labeled.

Bottom Line Up Front: Eight large hiring metros stand out, and less expensive cities within their communing distance look best by these variables. In particular, Newark, NJ, as cheap and close to lots of data scientist jobs, has the highest value in DSCA index v0.1. These two variables alone and the sources I chose show some interesting initial results, but they don’t seem to capture a complete picture, so more work is needed before getting a reliable reproducible index.

Twitter Sentiment with R on Azure ML Studio

sentiment_by_day

Downloading data from Twitter in R, running it through Azure ML Studio and analyzing the output back in R. It turns out to be rather involved. Here are the steps I’ve taken so far.

ASA Conference on Statistical Practice 2018, Friday 4 of 6, Working with Health Care Data

Visualizing missing data with VIM

ASA Conference on Statistical Practice 2018, Friday 3 of 6, Data Mining Algorithms & Presenting and Storytelling

Curb appeal

ASA Conference on Statistical Practice 2018, Friday 2 of 6, Streamlining Your Work Using (Shiny) Apps

Before Shiny

Highlights from Conference on Statistical Practice.

ASA Conference on Statistical Practice 2018, Friday 1 of 6, Keynote Address & Working with Messy Data

All lines

Some highlights from Conference on Statistical Practice sessions: to keep it simple, just a few notes and my favorite slide per session I attended. I work in Population Health Analytics at Legacy Hospital System in Portland, Oregon, so some of these presenters’ work may be interpreted through that lens.

Does less sleep today lead to more calories tomorrow?

Jun-Dec C v. D

Introduction

Last few months I’ve gained some weight so I’m curious if analytics can give insight and show opportunities to get my BMI to normal weight. As a first step, an inquiry into a hypothesis about calories vs. sleep.

Hello World! Here's a Normal Distribution!

This code simulates Normal(0,1), and this visualization shows smaller samples can vary much more than large samples from the true distribution. Maybe it’s not a fascinating picture although there is a deep mystery or two in there. Can we know the truth? Isn’t everything we know based on a sample? Is everything we believe, like these three rnorm(), an incomplete story?

random simulations output