Analyzing the Web For the Price of a Sandwich
-
Ben C., Software Engineer
- Mar 20, 2015
I geek out about the Common Crawl. It’s an open source crawl of huge parts of the Internet, accessible for anyone to use. You have full access to the HTML and text of billions of web pages. What’s more, you can scan the entire thing, tens of terabytes, for just a few bucks on Amazon EC2. These days they’re releasing a new dataset every month. It’s awesome. People frequently use mrjob to scan the Common Crawl, so it seems like a fitting tool for us to use. mrjob, if you’re not familiar, is a Python framework written by Yelp to...