I’ve been observing data via Excel and MySQL databases, which isn’t ideal, so R was a nice breath of fresh air. The dataset I’m most familiar with is my NHL Draft database, so I pushed it into R to see what I could do with it.

The first thing I did was look to test the null hypothesis with a randomization test. I previously wrote a piece about the importance of handedness in hockey, and I wanted to see where it landed on that randomized table. After running the R code and pushing out a histogram, I found that my results were far from random and I disproved the null hypothesis. Yay. This was an ideal experiment for an A-B test, but I had to go a step further for non-boolean data.

So right now I have data for a) when a player was selected and b) how that player has performed. I have a regression line going through that data — a logarithmic function — and it works decently. But the next step is to smooth these lines using local polynomials.

These strategies are incredibly valuable; I think Mark is right in that, at some point, it will come in very valuable. I had a few moments where I thought, “Hey, that’s something I’ve been working on for a while — and this strategy would be perfect!” I had other moments where I thought it would come in use in the future, though right now I had no application.

That said, the most interesting part of this course was the historical background of probability. I’ve often approached data with the mindset that there’s a ‘right’ way and ‘wrong’ way to do thing — and not really not a spectrum of ‘worse’ to ‘better’. And I appreciate how we thought about data as real-world indicators, and not reality. It strangely made data more real, which I often find tough when looking at a whole bunch of numbers in a table.

Moving forward, I plan to get more adept in R. I understand the concepts we talked about. I don’t know how to implement them — yet. The last class was incredibly helpful in helping me understand how to take data, run it through something like a Python script and then pushing it into R — and then adjusting the data further to plot it in a workable way. I understand that, in itself, could take several weeks. But I think it’s a worthwhile lesson plan to help us get data into a tool like R. At ITP, I’ve learned most by pursuing my own curiosities and then asking for (or already having been taught) tools to work through those questions. I think this class gave me ideas about the tools that existed, which was awesome. But I would’ve liked to walk through an example of, say, making local polynomials for give datapoints. If this were a full semester class, perhaps the contents of the course could be spread out and some of these examples could be worked through.

Robust regression is an alternative to least squares regression when data are contaminated with outliers or influential observations, and it can also be used for the purpose of detecting influential observations.