In my previous post, I talked about using Wikipedia to expand my taste in music by repeatedly using the random page function until I hit a page about music. A natural question came up: how long does it take to come across a page pointing towards a piece of music?
To test this, I programmatically obtained the first sentence of 1000 Wikipedia articles. I had initially planned to use plain web-scraping (using the requests
python package and a some HTML parsing), but it turns out there's a Wikimedia API that will return random pages in bulk. After some programming, I had my summaries in hand.1
Next, I created a small command line interface to manually classify each of the 1000 pages as being about music. I was purposely broad about this -- I classified any page about a musician, band, single, EP, album, composer, opera, musical, DJ, etc. as being about music. As long as I thought that seeing the Wikipedia page would point me towards a piece of music I hadn't heard before, I classified it as about music. This took around 40 minutes, and I learned a lot about the difficulties of classification problems.
All told, 74 of the 1000 Wikipedia pages were about music. Because this is a random sample,2 we can use a proportion z-test to get a confidence interval on the true proportion of articles that would be classified. Using some statistics, we get a 95% confidence interval of 5.8%-9.0% as the true proportion (calculation in footnote).3
Thus, we're 95% confident that it'll take on average between 1/0.090 ≈ 11.1 and 1/0.058 ≈ 17.2 tries to get to a music page. Not that bad, all things considered!
But what about worst case scenarios? Well, the probability that we look at $n$ pages and they all turn out to be non-music is (1-p)^n, so the probability we'll see a page within n pages is 1-(1-p)^n. The graph of this looks like:
A quick calculation shows that (with 95% confidence), the probability of getting an article within 20 articles is between (1-(1-0.058)^20, 1-(1-0.090)^20) = (0.695, 0.848); i.e. between a 70% change and an 85% chance.4 What about within 50 tries?5 Between 95.0% and 99.1%. After 100 attempts, the probability is between 99.7% and 99.99%.
In other words, even if our luck is bad and the proportion of wikipedia pages about music is at the bottom of our confidence interval, there's a 95% chance it will take 50 random pages or fewer to run into a music page.
Notes
-
Why you can access here! As you can see from the GitHub, I am currently writing a script to do all of this programmatically. I will talk more about the details in a future post. ↩
-
Well ... close enough to a random sample that it doesn't matter. There are two problems. First, selection is not completely random. The algorithm to choose a page generates a random number x between 0 and 1. Every wikipedia article has a distinct, random value between 0 and 1 associated with it, and the algorithm returns the article next up in value from x. The gaps between article numbers aren't uniform, so some articles are more likely to be chosen than others. Source. Second, when the Wikimedia API receives a batched random page request, it randomly chooses the first article and then chains the remainder deterministically. (E.g. if two API calls happened to both choose the page on the Cleveland Browns as the first to be returned, then the whole list of random pages will be identical. I think any given list behaves like a random list however. Source While both these caveats make the data nonrandom, it's biased in sufficiently arbitrary (and non-biased) ways that I decided to ignore it. ↩
-
Given a sample proportion p, the Standard Error is sqrt(p(1-p)/n) = sqrt(0.074(0.926)/1000) = 0.008$. We then note that our 95% confidence interval is 0.074 ± 1.96S.E = (0.058, 0.090). ↩
-
We can convert our previous confidence interval for the true proportion of music articles $p$ into this probability by directly computing our function on the lower and upper bounds. This is because there is a 95% probability that $p$ lies within (0.058, 0.090), and so there is a 95% probability that f(p) = 1-(1-p)^20 lies between (0.695,0.844) because f(p) lies in that interval if and only if p lies within our confidence interval. See this stack exchange post for instance. ↩
-
Which is significant because the Wikimedia API will only batch 50 requests at a time. ↩