Skip to Content

9 Positive Steps You Can Take to Keep AI Honest

By March 6, 2018

To reduce bias in AI requires us to take a thoughtful look at how input data influences output in machine learning. The most common criticism of machine learning is that we don’t always understand why it behaves the way it does – and like all computer programs, “if you put garbage in, you get garbage out.” Sometimes, stale or incomplete data (which contributes to bad statistical models) doesn’t get thrown out as frequently as it should be. Fortunately, there are ways to address this.

In this article we’ll go into why machine learning and AI face certain challenges, and outline how you can plan ahead to prevent or mitigate these challenges. All of us can best serve the public trust by following the guidelines established here. Nonprofits, and especially activists, are the ideal organizations to demand AI that serves everyone, not just the privileged.

The “how did it get that answer?” issue.

Recurrent neural networks are popular because they are powerful, although it is difficult to understand how they arrive at a particular result. Unlike a human expert, you can’t have a dialogue with a neural net to make it explain itself. This inscrutability introduces particular concerns in critical systems. These are systems that affect people’s lives, including credit scores, medical gear, industrial controls, and yes, defense systems. All of these problems are made worse when the programs have proprietary components that the public is barred from inspecting.

Consider a judicial aid trained on real-world court decisions. An increasing number of jurisdictions now use programs to assist in determining which defendants receive bail.

The systems vary in particulars, from the closed-source COMPAS system used in Wisconsin, to the open-source Public Safety Assessment Score adopted by courts in New Jersey. Court officers are cautiously optimistic, but the machine will, of necessity, encode our prejudices, and make the same mistakes. Some pre-AI judicial aides could even recommend denying bail based on zip code of residence, and I personally can’t see that as something a program should do. See here and here for articles on why bail itself is problematic.

You may wonder:

It’s a program. Can’t we just debug it?

In a word, no.

For most of the history of the computer, programs have been complicated, but data has been relatively sparse. Even so, we could determine a program’s intended behavior from its source code, which means the program in its preferred form for making changes. The complexity of a program is often estimated by counting lines of source. For example, the Linux kernel (which drives most servers on the Internet and every Android phone) contains million lines of code. As a human-readable format, the source code isn’t run directly; a compiler changes it into executable form.

The current revolution in AI involves small programs analyzing very large data to yield a model, which is the program that really makes predictions and decisions. The program for training and running a neural net is far smaller than the Linux kernel (Google’s Tensorflow is only 176,000 lines) though it may process tens of millions of inputs. For many machine learning algorithms, the training data is the only source code – and it’s damnably hard to predict how even small changes will affect behavior. Some image classifier algorithms can be fooled by a change as small as a single pixel.

Further, unlike a conventional programming language, training data is incredibly hard to debug. With a conventional program, we can use a debugger to see what is happening, a line at a time, and diagnose improper behavior. It’s not always easy, but it works. There’s simply no comparable tool for a deep learning system; working backwards from conclusion to premise is like unscrambling an egg. As a consequence, data scientists spend enormous effort cleaning, curating, and analyzing training data.

Bad Data Yields Antisocial Results

Okay, so we can’t debug AI. So what? If it’s trained on good data, won’t it give the same results as a human agent anyway?

Well, maybe, but because of bias, that may not be the result that you want. The problem of bias has two aspects:

    1. Statistical bias: your training data may not represent reality. This kind of bias is especially bad when working with rare events or small groups of people – the odds that they’ll be over-selected or under-selected become very high.

    2. Social bias: prejudicial data that paints a group of people in an unfairly negative or undeservedly positive light, such as by race or economic status or even ZIP code.

Algorithms trained on statistically or socially biased data will necessarily give socially biased results. So while statistical bias is just a fact of mathematics, to be controlled for but not feared, uncritical acceptance of the output of an algorithm can encode social bias into the very machinery of our daily lives.

How We Solve the Bad Data Problem to use AI for Good Community Sprints are an example of collaborating to build technology with the people who use it. We do not yet have an AI that can detect or correct prejudice in other AI. At this time, the best way to ensure that AI is used responsibly, is to insist that its builders commit to principles that never assume their own righteousness. Based on our studies, we suggest these nine principles:

1. Get outside the building and make observations.

Follow your AI into how it impacts the lives of case workers, volunteers, or “data subjects” as some call it. Watch what it’s doing. Put boots on the ground. Get your developers and data scientists out of the lab and talking to people.

2. Get input from people your AI will affect.

Find the people who are impacted by your AI, even if they aren’t your customers, and make their input part of your development process. These people are also stakeholders, whether you knew it or not. Respect them and listen carefully to what they have to say. develops NPSP and HEDA as open source platforms with input from the community at multiple annual events called Community Sprints. Create an “unconference” where you invite a diverse group to come together and let people who show up share their knowledge.

3. Always be collecting (and labeling) new data.

This includes new examples of existing types, and maybe entirely new feature sets. By analogy to databases, new examples means new rows; new feature sets would correspond to new columns. You can also backfill new columns in existing examples to improve your results.

4. Treat your process like a clinical trial.

Conduct proper testing, including separation of training and test data, and with multifold validation. Do you know the error rate? The number of true positives compared to false positives? The number of false negatives compared to false positives? The consequences to your stakeholders (possibly including the general public) of each type of error? If you can’t put a tight statistical bound on these numbers, then you may want to rethink a public release, or collect more data.

5. Give your data, and the resulting models, an expiration date.

You throw out stale milk when it smells bad, so why wouldn’t you toss old data when it’s no longer relevant? The real world is changing all the time. For every training example, you should be re-validating it against real-world conditions, and remove it when no longer relevant. Similarly, have a plan to replace and retire your old models before they start making mistakes. Don’t keep using models that were built from incomplete, obsolete, or flawed data.

6. Be respectful of privacy and practice good data hygiene.

You don’t have clothes from the 1980s still at your house, do you? Similarly, if data is no longer needed, delete it! It’s even more important than being fashionable.

If your data is about people, then those people have a stake in your policies and procedures, and you have a moral obligation to them. Salesforce follows a rigorous discipline about personal data, and so should you. According to the SEC, even auditing firms aren’t supposed to keep more than 7 years of client data. The new GDPR regulations in Europe are even more stringent. Organizationally speaking, don’t keep personal data if you don’t need it, or if the people involved object to you using it, or if they didn’t give clear and informed consent. Treat all personal data as need-to-know information. Only grant enough access for the scope of the job, and only for as long as necessary for the task at hand. Store personal data only on hosts you control, ruthlessly erase it when you’re done, and encrypt data at rest.

Be aware that even derived data may not be innocuous: you can de-anonymize a person based on very limited observations of their daily travel habits, and even ad-clicks can be notoriously embarrassing. When in doubt, assume that anything a user typed it in (or clicked, or tapped) is sensitive personal data.

7. Share data that can be shared.

If your data isn’t covered by Rule 6, consider open-sourcing it. As we’ve pointed out, curated and labelled data is as good as source code in machine learning, so contributing data to the community helps everyone. Nonprofits may have mission-related data that doesn’t exist anywhere else, and access to it may help scientists, policymakers, and other nonprofits.

8. Train your people.

Make sure that your personnel are trained on all seven of the foregoing points. Hold each other accountable for following these principles. No single person should have all the power over the system; similarly, all the people acting as “boots under the ground” under Rule 1, should share responsibility for making sure that you make these rules part of your culture. Make sure your technical people participate in any community events or un-conferences suggested under Rule 2.

9. Speak up.

Finally, when you encounter an agency that is using AI, speak up and ask that their system considers the public interest. Ask detailed questions about any of the preceding seven points.

  • Does the agency respect their stakeholders?
  • Are all components and data open to inspection? How big is the “black box”?
  • Is there a transparent feedback and redress process in place?
  • Finally, and most importantly, is everyone involved trained in the proper use of AI and the ethical consequences thereof?

Speculation: Maturing and Insuring AI

Perhaps we should build critical decision AI in the same way that we create pharmaceuticals – with clinical trials, with careful controls, and at every stage, oversight by persons familiar both with medicine and clinical ethics. Just like pharmaceuticals, AI should come with an expiration date. Real-world conditions are constantly changing, and for any problem of consequence, an AI model is going to become less valid over time.

We also need to consider what happens when things go wrong. Who is liable when AI makes a bad decision? The author of the library? The vendor of the device? The owner or operator? Vendors might not want to reveal their trade secrets, but in every other critical mission, a grant of public trust only comes after extensive testing and trials. The FDA doesn’t allow sale of drugs until there’s a credible (and, ideally, minimal) risk that the drug will make the problem worse. Answering this question is beyond the scope of this article, but we suggest that AI won’t really be ready for critical missions until insurers are willing to underwrite those risks. The author humbly suggests that the real sign of maturity in the field will only come when “AI Insurance” is both a reality and a mundanity – not a joke or a novelty.

The bottom line is: Even if you don’t develop AI in your organization, demand accountable AI whenever you encounter it!

About the Author
Phil NadeauPhil Nadeau is a lead member of the technical staff at Salesforce. In 2017, he was a Technology Fellow in Artificial Intelligence. He started using Linux 25 years ago and has been working in software development for almost as long. Phil has written tens of thousands of lines of code with the LAMP stack (Linux, Apache, MySQL, Perl), Java, C, and a variety of other languages. In 2012 he graduated from Western Washington University with a Masters of Science. He enjoys helping make sense of the Internet, using Java, Scala, Spark and Python primarily for his work in search engineering. One of the highlights of his career was as a programmer on an experiment in machine vision at Bell Laboratories for controlling video games using a motion capture system made from vintage Silicon Graphics workstations and old analog video cameras. For previous work, see Phil’s blog Why You Should Care about AI and Where Did AI Come From.