The fundamental step to work on a Machine Learning project is to state well the problem you want to solve. Usually, the problem definition comes from a business or a domain like marketing, finance, medicine, engineering, and others. Thus, we must state the problem in terms of the business or the domain and then map it to a typical Machine Learning task.
Once we have clarity on which problem we are trying to solve, we need to get the necessary data to solve the problem. Data is the feedstock for Machine Learning models. They matter more than the algorithms you plan to use.
But building a dataset is not a trivial task. In a business environment, we need to reach other departments to get access to databases, documents, and unify different data sources. For personal projects, this is more difficult because sometimes we don’t even have access to the data we need.
So if you are working in a project for your portfolio, it is a good idea to work on real and public datasets. Besides saving time from collecting and processing data, your work on open datasets can provide useful insights into public interest.
In this post, we list some public repositories from which you can download datasets to use on your projects.
Collaborative Repositories
Universities and companies maintain these repositories. Researchers and practitioners from different areas and domains make their data available on these repositories:
Government and political organizations
Governments and organizations keep portals of public datasets about the economy, education, health, agriculture, and other areas of public interest. Below check data repositories from some international organizations:
- UNdata (United Nations)
- World Bank Open Data
- Global Health Observatory data repository
- UIS Statistics (UNESCO)
- IMF Data
Also, many countries maintain open data repositories:
Dataset search
Finally, you can search the Internet for the data you need. You can use the Google Dataset Search tool to search the Web to find the dataset you need:
Conclusion
Good Machine Learning projects start with quality data. Use the lists and tools above to find the data you need for your project. Also, you can navigate and explore the repositories to find ideas for new projects.