Improve security posture of your Machine Learning solution without compromising on performance

“If you think technology can solve your security problems, then you don’t understand the problems and you don’t understand the technology.” — Bruce Schneier

In the fast-changing world of Machine Learning (ML) -security, privacy and compliance are quickly becoming a real challenge. In this article, we will look at some of the best practices that can be adopted to address these challenges without compressing the power of ML. In order for a ML model to be successful it needs reliable and accurate training data. The training data needs to be either real-world or very close to it for a better prediction ratio.

There are primarily three major variants of learning algorithm that are in play. Each of these would require large to minimal set of human annotated/labelled data:

Supervised learning: This requires fully manual data annotation/labelling by a human analyst. Notwithstanding, this entails a large quantity of manual work, can be expensive and time consuming. For instance, in order to label (e.g cars) a 10 seconds video of 30 frames in a busy road, it would take a human analyst more than 60 minutes in average.

Semi-supervised learning: This is a hybrid between labelled and unlabelled data. ML uses small amount of label data to identify unlabelled data and over time accuracy rate increases. There are various methods that can be used for this such as statistical (generative — most common uses cluster and label), Heuristic, Graph-based (Regularised) and low-density separation.

Active-learning: This is currently evolving and requires far less labelling (uses human-in-the-loop). However, for it to be effective, it requires a noise-free environment which may be a challenge sometimes.

The purpose of this article is not to get into deeper layers of ML but to lay out some of the best practices to improve security and compliance posture for ML solution. In particular, where human involvement is required for data labelling for systems such as sentiment analysis, advanced speech transcription and image processing, it’s not always possible to completely eliminate manual labelling for a system that requires domain-specific knowledge.

Security landscape

ML is relatively new and attack vectors are still evolving which poses a major challenge to both ML engineers and Security professionals. Nevertheless, broadly speaking, there are three major known categories of vulnerabilities (excluding any weaknesses in the algorithm):

Image:openAI

1. Evasion — This is the most successful attack vector so far (demonstrated frequently in security conferences), as it works based on obfuscation against signature/rule-based system. Simply put, the idea is to fool the system by supplying adversarial input to trick the classifier. As depicted in the above picture, a standard washing machine image is modified slightly to make the classifier believe that it’s a loudspeaker. Such examples of these attacks exist from a simpler form of spam filter to a serious form of attacks on healthcare and military systems. It’s one of the most difficult form of attack to defend against.

2. Poisoning — This attack vector targets primarily a training data. The idea behind this attack is to get model to classify malicious data as a valid input dataset. This is considered as White-box — as the attacker needs to have knowledge of the system and annotation/labelling cycles to manipulate it.

3. Data leak — Out of all the three, this can cause the most reputational and financial damage to any organisation. Imagine if the organisation’s training data containing sensitive, financial and PII/PHI information gets leaked, then it could have a devastating effect on the organisation.

Compliance landscape

In this day and age, every organisation complies with various regulatory and legal requirements. In particular, areas surrounding privacy is much stronger than ever before. With ML being data thirsty, it’s imperative to understand how your organisation’s training data is being handled. The challenge with compliance is more prevalent when data annotation is handled by third party sub-processors. Data annotation/labelling is a specialised area with domain expertise and it’s not possible for an organisation to hire large amount of data labellers inhouse.

Defence strategy

Even though attack vector for ML is evolving and manual data annotation could potentially get organisation into hot waters, fortunately, there are some best practices that we can follow to overcome security, privacy and compliance challenges without having to spend massive amount of time and resources.

Below are some of the best practices that can be adopted:

★ Legal agreement with the customer clearly explaining the scope and purpose.

★ Comply with GDPR/CCPA and other privacy requirement is a must for any organisation that deals with users’ data for training their models. if your organisation handles any form of call recordings then it has to satisfy one of the following conditions according to GDPR Article 6:

✓ The participants of the call have given consent to the recording

✓ Recording is necessary for the performance of a contract

✓ Recording is necessary for compliance with a legal obligation

✓ Recording is necessary in order to protect the vital interests of the participants

✓ Recording is necessary for the performance of a task carried out in the public interest or in the exercise of official authority

✓ Recording is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the participant

Compliance engine framework

★ All training data has to be vetted through DLP/Compliance engine for sensitive/PII data redaction. For speech recordings, this could be a challenge as many of the compliance engine works accurately based on textual format than a voice recordings. One of the simpler ways to work around this is to transcribe recordings first before passing it through the compliance engine for validation. By doing so if the audio recording happens to have any sensitive information then its discarded before it gets into training data for labelling.

★ Implement stronger encryption at rest and transit by following NIST SP-800 in particular SP-800–111/57/52.

★ Make RBAC with MFA is a mandate for all who has access to the training data

★ Audit logs with regular review process

★ Any third-party/sub-processor should have similar compliance and attestations standing as the organisation that’s supplying the training data for labelling

★ For any image processing, utilise deep privacy (https://github.com/hukkelas/DeepPrivacy) where possible to anonymise

★ Disclose name of all your third-party/sub-processors with proper data classification in the privacy data sheets

★ Perform regular audits on your third-party/sub-processors to assess their operational hygiene

★ Create a tooling to perform post validation of the model against the data for poisoning by measuring classifier delta. There are some good opensource tools can be used for this purpose such as deep-pwning (it needs some amount of improvement based on your use case requirement)

★ Create and maintain dedicated Threat modelling for ML. For guidance, there is a good paper on “Failure Modes in Machine Learning” by Ram Shankar Siva Kumar, David O’Brien, Kendra Albert, Salome Viljoen, and Jeffrey Snover

★ Feed your model with potential abuse datasets. Abuse case can be derived from threat modelling and attack surface analysis.

★ Monitoring your ML models is an absolute necessity not just for security but to get insight into overall health of your models — MLWatcher is one such tool released under MIT Licence

Conclusion

Active learning is the path many organisations are investing heavily due to manual labelling hindrance. However, on the downside, it increases the attack surface unless human-in-the-loop have definition-of-done process to validate pre/post training data.

To conclude, in order for an organisation to be successful in adopting and developing ML models it needs both security and compliance program in place. In comparison to other software verticals where large part of effort is put on security and minimal effort is dedicated to compliance.

Security Architect @some where over the rainbow. All publications are personal not related to my employment