[Data] Observing data with R

I’ve been observing data via Excel and MySQL databases, which isn’t ideal, so R was a nice breath of fresh air. The dataset I’m most familiar with is my NHL Draft database, so I pushed it into R to see what I could do with it.

The first thing I did was look to test the null hypothesis with a randomization test. I previously wrote a piece about the importance of handedness in hockey, and I wanted to see where it landed on that randomized table. After running the R code and pushing out a histogram, I found that my results were far from random and I disproved the null hypothesis. Yay. This was an ideal experiment for an A-B test, but I had to go a step further for non-boolean data.

So right now I have data for a) when a player was selected and b) how that player has performed. I have a regression line going through that data — a logarithmic function — and it works decently. But the next step is to smooth these lines using local polynomials.

These strategies are incredibly valuable; I think Mark is right in that, at some point, it will come in very valuable. I had a few moments where I thought, “Hey, that’s something I’ve been working on for a while — and this strategy would be perfect!” I had other moments where I thought it would come in use in the future, though right now I had no application.

That said, the most interesting part of this course was the historical background of probability. I’ve often approached data with the mindset that there’s a ‘right’ way and ‘wrong’ way to do thing — and not really not a spectrum of ‘worse’ to ‘better’. And I appreciate how we thought about data as real-world indicators, and not reality. It strangely made data more real, which I often find tough when looking at a whole bunch of numbers in a table.

Moving forward, I plan to get more adept in R. I understand the concepts we talked about. I don’t know how to implement them — yet. The last class was incredibly helpful in helping me understand how to take data, run it through something like a Python script and then pushing it into R — and then adjusting the data further to plot it in a workable way. I understand that, in itself, could take several weeks. But I think it’s a worthwhile lesson plan to help us get data into a tool like R. At ITP, I’ve learned most by pursuing my own curiosities and then asking for (or already having been taught) tools to work through those questions. I think this class gave me ideas about the tools that existed, which was awesome. But I would’ve liked to walk through an example of, say, making local polynomials for give datapoints. If this were a full semester class, perhaps the contents of the course could be spread out and some of these examples could be worked through.

Leave a Reply