My Role
Data Scientist
Background
During a Combient Mix Hackathon I worked together with Data Scientists Nina Novak and Rahul Biswas to find a solution to discrimination biases within datasets.
Pitch
Any discrimination based on any ground such as sex, race, colour, ethnic or social origin, genetic features, language, religion or belief,
political or any other opinion, membership of a national minority, property, birth, disability, age or sexual orientation shall be prohibited.
In the beginning of data usage, we all hoped that we could stop making biased assumptions. We hoped that data would guide us to a more just future. We have come to realize that it is not that easy. Datasets can be as biased as humans.
If you google Grandma, this is what you will find.
These pictures are not a diverse enough representation of grandmas on a global scale. A reason for this is that there is no governance over data-input to ensure that data is diverse and accurate. If you use this data to make decisions about grandmas, your results are going to become biased. In this example the problem is very clear, but what about your data? Hopefully you know but too often you cannot because you don’t even know what all the columns mean in the first place. Can you be certain it is not not biased and do you have time to test it?
In the infamous event of 2015, Google built a classification model that classified African-Americans as gorillas. This was not the
work of trolls, it was the work of a biased dataset.
Data can not understand cultural sensitivity or what is offensive to us as humans.
Solution
We built a tool/dashboard to help Data Scientist to be aware about the discrimination bias within a dataset. Biases that otherwise would be hard to find.
We used two data sets to test our tool:
- Kaggle Heart Failur Data https://www.kaggle.com/andrewmvd/heart-failure-clinical-data
- US Adult Income Data https://archive.ics.uci.edu/ml/datasets/adult
Screenshots from Dashboard
Literature
- https://arxiv.org/pdf/1908.00176.pdf
- https://core.ac.uk/download/pdf/81728147.pdf
- https://arxiv.org/pdf/1904.10761.pdf
- https://fairmlbook.org/pdf/fairmlbook.pdf
- https://blog.dataiku.com/explaining-bias-in-your-data
- Proxies: https://papers.nips.cc/paper/2018/file/6cd9313ed34ef58bad3fdd504355e72c-Paper.pdf
- Proxies (2): https://arxiv.org/pdf/1707.08120.pdf
- https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598