30 BEST OPEN DATA SOURCES FOR DATA SCIENTISTS
To be proficient in data science, you need to practice and try various projects. data science is best learned by doing it. To practice, create models and visualize data, the first thing you need is datasets. In fact, without data, there is no data science. In that regard, we have scoured the internet and compiled a list of thirty open data sources websites which have data in almost every subject.
The projects you create while learning may be useful for you in the near future when you launch your job hunting adventure. Always ensure that every project you implement is kept. These are your portfolio projects and you can present them to the potential employers for consideration.
Go through the list of websites below to find datasets you may be interested in.
1) KAGGLE
Kaggle is probably the largest online data science community. The google owned platform offers users a comprehensive platform to find and publish data sets, build models and work with other scientists in a web based environment. A part from thousands of datasets available for practice, Kaggle also has data science challenges and competitions which can enhance your learning experience. There are also data science tutorials where you can begin and Kaggle kernels, a cloud-based workbench which allows you to share your projects in python and R.
If you have mastered some data science concepts and would like to get a job, Kaggle jobs board is there to sort you out. If you are interested in learning data science, make Kaggle your friend and you will get never get lost.
2) UCI MACHINE LEARNING REPOSITORY
This is a great site to get datasets for your machine learning projects. It is widely used by students, researchers and educators across all the world. The data is neatly categorized based on data types, attributes and the area the data is obtained from. There are plenty of data in the area of sciences,business and games. All you have to do is search the datasets you are interested in.
The world bank publishes a huge amount of data on different countries and regions. There is data on census, demographic, health, agriculture, income, GDP etc. This is a great platform you can search any sort of data you are interested in.
4) IMF
The International Monetary fund has most of the financial data you need. Data on IMF lending, exchange rates and other economic and financial indicators in all member countries are available. If you are doing projects in financial modelling and analysis, you should check on this website.
5) AMAZON REVIEWS
Amazon is the world largest marketplace with millions of visitors every month, as result, colossal amount of data is generated daily. This dataset consists of close to thirty-five million consumer reviews on products, ratings and user information spanning in a period of 18 years till 2013. You can go through the various categories and practice while learning.
6) CLIMATE DATA ONLINE
Climate Data Online (CDO) provides free access to NCDC’s archive of global historical weather and climate data in addition to station history information. These data include quality controlled daily, monthly, seasonal, and yearly measurements of temperature, precipitation, wind, and degree days as well as radar data and 30-year Climate Normals. Customers can also order most of these data as certified hard copies for legal use.
7) US CENSUS DATA
The united states Census bureau provides data about the US citizens and their economy, population, housing, workforce, facts and figures. You can obtain these and more datasets from the link below.
8) DATA.GOV
Managed and hosted by the U.S General Services Administration, Technology Transformation Service, this is another huge open source data available for research, data manipulation and visualization. Data on climate, Agriculture, local governments Maritime etc. are available. You can search for the data keywords you are interested in. Some datasets are downloadable while others are links to websites or apps that help you access or use the data.
Website: Data.gov
9) BUREAU OF ECONOMIC ANALYSIS
BEA is an agency of the Department of Commerce. data on US gross domestic product also known as GDP, foreign trade and investment and industry data are available for research and analysis in this website.
10) UK DATA SERVICES
The UK data services was created to meet the data needs of researchers, students and people from all sectors including academia, central and local governments, charities and foundations, independent research centers, business consultants and commercial sectors. There are UK government-sponsored surveys, UK census data, business data and qualitative data. The data here is available for anyone to use provided you register.
11) BUREAU OF LABOR STATISTICS
The Bureau of Labor statistics has data on market activity, working conditions, price changes, inflations, pay and benefits and productivity in the US economy.
Bureau of Labor and Statistics website
12) ENRON EMAIL DATASET
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. You can check the website below for more details.
13) FEDERAL RESERVE
You can access over 500,000 financial and economic data series from more than 85 public and proprietary sources based in the United States. Data on currency, interest rates, inflation etc. is also available.
14) OPEN DATA FOR AFRICA
If you are looking for datasets specific to Africa, There’s plenty on this website. Data on energy, infrastructure, monetary statistics, governance, environment etc. could be found here. You can browse the data by countries or search what you are interested in.
15) GROUP LENS
GroupLens Research has made available rating data sets from the MovieLens web site. At least 25 million movie ratings are available in this site. If you are interested in some movie analysis, checkout on this site.
Quandl has a huge amount of financial and economic data. If your projects are centred around financial analysis, this is where you can find data that may help you in your analysis.
17) YELP OPEN DATASETS
The Yelp datasets is a subset of Yelp businesses, reviews, and user data for use in personal, educational, and academic purposes. Available as JSON files,you can use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps. Millions of datasets are available.
18) MOVIE REVIEW DATASETS
This is a dataset for binary sentiment classification. There are 25,000 movie reviews and you can use them in your projects.
19) MICROSOFT COCO
Coco has about 330,000 images most of them labeled. You can download the datasets and explore.
20) TWITTER SENTIMENTS
If you are doing sentiment analysis projects, be sure to check on this website which has great amount of resources that will definitely add value to you.
21) AIRBNB
The data behind the Inside Airbnb site is sourced from publicly available information from the Airbnb site.
The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion. The dataset in this website is available under creative commons license.
22) NIST
National Institute of Standards and Technology has some datasets you can explore.
23) REDDIT
A reddit community where find, share and discuss Datasets. You can join the community and post the datasets you are looking for. Users will help you.
24) IMAGENET
This an image database, there are more the 14 million images available for researchers, educators and students. If you are doing image classification project,you can check this site.
25) GOOGLE
Goggles open image datasets has approximately 9 million URLs to images that have been annotated with labels spanning over 6000 categories.
26) BELGIAN TRAFFIC SIGNS
The dataset here is related Traffic Sign Recognition.
27) STANFORD DOGS
The Stanford Dogs dataset contains images of 120 breeds of dogs from around the world. You can download them and use in your projects.
28) BERKELEY DEEPDRIVE
This is probably the largest driving video dataset with 100,000 videos and 10 tasks to evaluate. You can download, create models and train algorirhms.
29) UCSD LISA
The Laboratory for Intelligent and Safe Automobiles has huge amount of datasets on traffic signals, vehicle detection etc.
30) INDOOR SCENE RECOGNITION.
This is database containing 67 indoor categories and a total of 15620 imgaes. The number of images varies across categories but there are at leats 100 images per category.