Bosses love to hear the word “free.” Everyone wants to get something for nothing. The good news is that there’s a burgeoning collection of free data available for the taking. Some of it might even be useful for your project or your career.
What’s the catch? Sometimes there’s no catch at all. Many of the sources below come from government agencies. Once they’re done collecting the information, it often costs them very little to share it openly with everyone. Technically it’s not free because you’re paying for it on April 15th. But the good news is that your project budget won’t feel the pinch.
Other data collections are a subtle form of advertising. All of the major cloud companies host various collections of open data sets. You don’t need to use their cloud servers, but the performance will be that much better when the bits are stored in the same data centre.
The cloud companies could be purchasing 30-second spots on the Super Bowl, but this form of advertising is a better strategy for everyone.
The one danger with working with cost-free data is that the boss will assume that it’s also trouble-free. Many times the data will require a bit more work on your part. Perhaps the government agency that collected it liked to use its own peculiar format. Perhaps the data needs to be re-aggregated for your needs. There’s a good chance you’re going to need to write a bit of code to get it to work.
Some of the data projects function like open source software and work best when everyone contributes their own small part. I have a weather station in my backyard hooked up to the Personal Weather Station network that gathers data from close to a quarter million different citizen scientists.
Participation is essential, but you’ll be able to leverage the work of everyone else at the same time. If your work is going to help build these projects, be prepared to pull your weight with project management.
The good news is that the barriers to entry are small. You don’t need to ask permission and you don’t need to beg forgiveness. Here are N different corners of the web to just start downloading and exploring.
Some of the data sources are not much more than a file repository. Kaggle is more of a cult. They've started with more than 50,000 different data sets and then added the basic tools (Jupyter notebooks) for making sense of them.
There are already 400,000 different public notebooks that other data scientists have shared that analyse the data underneath. On top of that, Kaggle has added some online courses on using everything and mixed in some competitions with real cash prizes.
For instance, Cornell’s Laboratory of Ornithology is offering $25,000 to the best classifiers for birdsong, or what they call “bird vocalisations.” The Open Vaccine initiative will award $25,000 to the best models for predicting RNA degradation that will affect the Covid-19 vaccine.
There is plenty of serious work to be found among the CSV or JSON files, but if you grow tired you can also have some fun. One data collection, for instance, is filled with lines scraped from all of the Star Trek episodes from the six major series.
TheFiveThirtyEight website is devoted to reporting stories with the support of a rich collection of data. When they can, they also share these data sets for you to do your own research. There are past records of their predictions for the major sports leagues, explorations about social attitudes like surveysof men asking what it means to be a man, and, of course, endless pollsabout upcoming political votes.
The UN agency responsible for helping raise healthy children around the world shares a wide variety of data sets that are useful to anyone with the same goals. The big picture can be found in marquee data sets like The State of the World’s Children 2019 Statistical Tables for those who want to track the change numerically. A more focused visualisation can be discovered in tables that explore how iodised salt affects disease or the success of primary education.
Ohio State’s library keeps a web page current with pointers to some of the biggest collections of economic and financial data. There arehistorical records of US data sets and also some data collected by theWorld Bank. Some require an academic account and some are free to the public.
America’s sport is blessed by some fans who are adept enough with computers to develop extensive collections of data about the players and the results of their games. Sean Lahman’s database, for instance, contains complete batting and pitching statistics from 1871 through 2019.
There are also tables of other details like fielding statistics, managerial changes, and World Series results that may not be complete, but might as well be for the modern era, which in major league baseball begins with the 20th century.
Project Retrosheetwas started to assemble play-by-play summaries of all major league games whenever possible, and it is now complete through 1974. If you happen to have access to a scorecard from an earlier game, check the “most wanted” list to see if you can fill in a hole. Chadwick Baseball Bureau maintains a GitHub repo for the data if you prefer.
If you’re just looking for a particular data set, Google Dataset Search lets you search the entire web for data sets using keywords. The results can be filtered by license, data format, and the time since the last update.
Some of the most intriguing data sets are also included in Google’s public data directory, which not only lists the sources but offers some interactive dashboards. The World Bank, for instance, charts fertility versus life expectancy and you can track how this changes over the years with a slider.
Amazon Web Services (AWS)
AWS users who want data stored in S3 buckets can turn to the Repository of Open Data on AWS, or RODA. There’s wide variety in the thousands of data sets but the highlights tend to be the data sets from sources with which AWS is openly collaborating like the Space Telescope Institute (stars), NOAA (NEXRAD weather radar imagery), and Common Crawl (more than 25 billion web pages).
Microsoft also has a number of data sets on Azure. City planners can look for insight in the records from the New York CIty taxi board, which tracks all fares. Economists and traders can look at price records for commodities for insight on inflation and economic changes. All are ready to be analysed by Microsoft’s machine learning tools.
Some of what we store on Facebook is private because we make it so. Some is shared with friends. Some content is completely open.
Facebook supports research on the so-called “Facebook graph” with their Graph API. It’s not the same as downloading the entire data set, but it can be useful for some queries. Just remember that not everyone uses the same privacy settings, so you might not see every person or every post.
The website known for reviews of restaurants, bars, and other public accommodations shares a great deal of the information in a public data setthat you can study. There are more than eight million reviews of more than 200,000 establishments just waiting for you or your AI to parse them. They are a good source for training data for natural language processing and machine learning.
Open Data Kit
The code lets you create a user interface that simplifies data collection by the front-line researchers and then begins the classification and cleaning workflow. The tools are used by a diverse group of organisations supporting field research including the World Mosquito Project and the Red Cross.
Not all data reside in easily accessible databases with APIs. An enormous volume of information is embedded in web pages and the data needs to be pried out of them with some clever tools. This so-called web scraping is still a pretty good method, but it can have legal limitations.
Some sites ban it in their terms of service and others watch for too many requests from one user and then either cut off the user or slow down the responses.
Tools like Puppeteermake it simpler to spin up one (or many!) headless versions of a web browser, download a web page, extract the right data, and do it again and again. There are now headless versions for most major browsers, thanks to the software testing community that needs to automate the testing process.
Web scraping may not always be appropriate, but when it is it can be the fastest way to get the data you need. Nothing is more open than the open web.