Analyzing data may be the main job of data scientists, but keeping it secure and private is not far behind. This is not easy—data is more complex and shared more widely than ever before, making it exponentially more vulnerable to today’s latest threats.
Of course, this is a significant concern for healthcare organizations and other businesses, even before the heightened fines imposed by recent regulations like HIPAA and the GDPR (not to mention the operational and reputational damages that have been rising, too).
As a result, data scientists and organizations are increasing their focus on security. For organizations, this means implementing stronger data security protections across the board. For data scientists, it means eliminating the risk of exposing the sensitive data within datasets, while still being able to understand and leverage that data.
The key to this can be data masking (also known as data anonymization or pseudonymization).
What is data masking?
Data masking is defined as “a technology aimed at preventing the abuse of sensitive data by giving users fictitious (yet realistic) data instead of real sensitive data,” according to Gartner.
Data masking involves applying transformation algorithms and other character-swapping techniques to replace sensitive data with similar-looking fake data, so that anybody viewing the dataset wouldn’t know the data had been masked. By doing so, data scientists can avoid exposing sensitive data by masking it, while still being able to interpret and analyze the data.
Unfortunately, data masking reduces all visibility, making an obvious conundrum for organizations. Data scientists need to find a balance where they can ensure sensitive data is protected without masking so much data that it becomes difficult to analyze.
Further complicating matters, the data security and privacy regulations are also expanding the scope of sensitive data that needs to be concealed to include personal information in addition to private information. Organizations are now responsible for concealing many forms of sensitive data, including:
- Personally-identifiable information (subject to GDPR and many other regulations)
- Protected health information (subject to HIPAA regulation)
- Payment card information (subject to PCI-DSS regulation)
- Intellectual property (subject to ITAR and EAR regulations)
These regulations also reduce how long personal data can be stored, limit international data sharing and require all-around data minimalization, which involves collecting, storing and using the minimum amount of data needed. As a result, data scientists need to be even more thorough and diligent with data masking.
Data masking vs. encryption
To clarify a common misperception, data masking and encryption are not synonymous. Both are aimed at ensuring security and privacy by concealing sensitive data, but while masking converts sensitive data into similar-looking fake data, encryption converts sensitive data into scrambled, often-unreadable data.
The fundamental difference is that for encryption, reversibility is required, as the data is useless unless it can be decrypted (with the decryption algorithm and the original encryption key). Reversibility of data masking is a weakness, as it indicates the fake data is vulnerable to being converted back into real data and therefore exposed.
Masking is a much more useful technique for data scientists because it enables them to interpret and analyze data while it’s concealed, unlike encryption. This is because with masking, data values are changed but data formats remain unchanged.
For example, credit card numbers have a 16-digit format that looks like this: 1234-5678-9123-4567. Masking data changes the numbers, but maintains the same 16-digit format. Using the example above, the masked credit card number could become: 9876-5432-1987-6543. Data masking uses several methods to alter sensitive data, including character or number substitution, character shuffling, or the use of algorithms to generate random data that has the same properties as the original data.
Data masking best practices
Data scientists need to follow the same security practices as any IT professionals, such as using strong authentication, avoiding sharing passwords, and only working in protected environments (i.e. only working from personal devices if they’re secured).
For data masking, there are a few things that should be kept in mind:
- Only mask sensitive data—this of course varies widely, but it’s often just three or four columns of the dataset. No need to waste resources masking and reducing visibility around non-sensitive data.
- Use a strong masking algorithm—or a variety of complex character substitutions to ensure the data is masked sufficiently. If it’s reversible by others, it’s ineffective.
- Maintain the same data structures—not just the structures, but the relationships between database rows, columns and tables. This is necessary to keep the relationships between values preserved after the data is masked.
- Ensure the range and distribution of values are kept within realistic limits—for example, when masking birthdates, the data shouldn’t indicate that employees are 200 years old.
- Test your results—if you don’t get the desired results in tests, you can restore the data to its pre-masked state and tweak the masking algorithms.
- Create a repeatable process—you should be able to repurpose this for other similar datasets. This also ensures any new data added to the database is masked consistently.
Data masking will continue to be increasingly important given the growing threat of cyberattacks, as well as increasingly challenging given the evolving data privacy regulations. This means that as we move forward, data scientists will need to continue walking the line of remaining dedicated to their data masking processes, while also being flexible enough to accommodate the inevitable changes around how they’re expected to mask data.
Date: May 13, 2019
Source: Health Data Management