Best Practices for Training Data Management

If you develop machine learning models, you probably spend much more time thinking about your code than about how you store the training data that you feed into that code. The way your training data -- meaning raw data that you create and label yourself for algorithm training purposes, or produce through a tool like Watchful -- is stored or accessed may not seem very important relative to what your algorithms do with that data.

The reality, however, is that the way you manage training data does bear important implications for your ability to build effective machine learning models. In order to optimize the training process, it’s critical to store data in a way that ensures its reliability, security, and availability, even if the training data changes over time. With those needs in mind, here’s a look at five best practices for managing training data. Although they may not apply in all situations, depending on the type of machine learning algorithms you write and the kinds of data you use to train them, these guidelines will help you think about how to store training data in the most effective way.

#1 - Divvy up your data

The simplest way to store training data may be as one large file. But this is not usually the best way, especially if you have a large volume of data. Instead, break down your data into discrete parts. You could divide up a large data by, for example, sectioning it based on which module of your algorithm the data helps to train, or when the data was produced.

Ideally, no one file will be larger than about 1GB. More than that, and it becomes harder to transfer data continuously over the network (because flaky connections might disrupt transfers of large files before they can complete). It may also be more difficult to drill down and access a particular part of your data if it all exists in one huge file. Even simply opening the file in order to read or edit its contents, or to feed it into an algorithm, can take a long time if the file is very large. Of course, some files could end up being quite large when they are initially produced (for example, continuously labeled data from an ETL pipeline will tend to be quite large), so you may need to transform or restructure the data once it is produced.

In addition to making it easier to move data around, a strategy that avoids storing all data as a large blob also gives you more flexibility in how you store it. Some parts of your data set might make more sense to store as simple files, for example, if they contain raw, unstructured data. Flat files are also easier to access and to move from one storage environment to another without having to perform data translations or migrate from one structure to another. Others that contain structured data or have rich metadata associated with them could be better inside a database, which will allow you to organize the metadata in a logical way. Databases also make it easier to join data with other sources and to “slice and dice” specific parts of the data.

The bottom line: Although the way you organize and structure your training data will depend on which kinds of data you maintain and how it is structured, you should try to avoid storing it as one huge file, which makes it unwieldy and more difficult to access.

#2 - Use Version Control for Your Training Data

Your training data will likely change over time. You may purge it of low-quality entries that you discover during the training process, for example, or you may add new data to help train your models more effectively. In order to ensure that you have a systematic record of how your datasets change over time, and that you can revert to an earlier version if necessary (which you may need to do if, for example, you add new data but then discover that it is ineffective for training your model), you should use some kind of version-control system for your data.

There are three basic ways to go about version-controlling your data. The first is to do it manually by taking periodic snapshots of the data and storing each one as a backup file that you can revert to later if needed. This approach is the simplest, but its main drawback is that it doesn’t give you a continuous record of changes to your datasets. You only get snapshots from particular points in time.

The second approach is to manage your data with an automatic version-control system. You could possibly do this with a tool like Git, which, although it was designed to manage source code rather than data, can nonetheless be used for databases. (Git is especially effective if your data is text; things are tricker if you have non-text files.) Alternatively, you can store your data in a database, like MySQL or SQL Server, and use a database version control tool to manage it. The potential downside to this approach is that it forces you to store the data in a formal database, which is not ideal in all cases because it can be more difficult to move data from one type of database to another. Depending on how your machine learning model ingests the data, you may need to convert the data to a different format before you can feed it to your algorithms. On the other hand, as noted above, databases work well for situations where data needs to be joined with other data or restructured frequently.

The third approach is to store the data on a cloud-based storage service that offers built-in versioning, such as AWS S3. This strategy automates the process, with the tradeoff that, because versioning on S3 and similar storage systems is designed more as a safeguard against accidental deletion than as a full-fledged version control system, the versioning is less fine-grained than it would be with a system like Git.

#3 - Back Up Training Data in Multiple Locations

You may think that if your training data is stored in the cloud, and/or if you manage it with a version-control system that allows you to revert to earlier versions, you don’t need to back it up. In reality, you absolutely should perform regular backups of your training data. Cloud storage services can and do fail. So can your version-control system if the servers or disks that power it go down.

To protect against these risks, keep your data backed up using the 3-2-1 backup rule, which states that you should maintain three distinct copies of your data at all times, using at least two storage media (such as local disks and cloud storage buckets), and with at least one copy stored on offsite storage (such as the cloud). Backing up training data in this way requires some investment in backup tooling and storage, but it’s well worth it if it means you don’t have to worry that the training data you carefully collected and processed will disappear.

#4 - Control Access to Your Data

Preventing third parties from accessing your training data may not seem important, especially if it’s artificially generated synthetic data rather than real-world data based on actual people or things. However, keeping your training data safe by preventing unauthorized access is critical for two reasons, even if it’s synthetic data. The first is that synthetic data is sometimes based in part on real-world data that was in theory anonymized, but could still contain private information. Restricting access to that data is a best practice that will help you avoid potential compliance issues related to the leakage of private data.

Second, training data provides a lot of insight into what your machine learning models do. Unless you’re an open source project, you probably don’t want your competitors to gain an insider’s look at how your code works, which the training data could help them do. So, for the sake of your own intellectual property and privacy, you should make sure the data is stored using methods that restrict access. Storing data using encrypted volumes is an easy way to do this. If your data lives in the cloud, you’ll also want to be sure you lock it down using IAM rules (and whatever you do, never, ever make the mistake of accidentally granting public access to your cloud storage buckets).

#5 - Store Training Data in an Accessible, Future-Proof Format

Today, it may be hard to imagine a world where the databases we have known and loved for decades can no longer be opened, or the API you use to pull data from cloud storage is no longer supported. But that’s always a risk, which is why it’s wise to store training data in a format that gives you the greatest chances of being able to access it indefinitely into the future. Even if your data sits dormant for ten years, you may want to reuse it again at some point in the future, and it would be disappointing to discover at that time that the software you need to access it is no longer available.

The easiest way to ensure long-term access to your data is to store it in plain text. Failing that, open source databases like MySQL are a safer bet than those from commercial software companies, which could decide to stop supporting them. If you store data in the cloud, sticking with a big-name cloud provider rather than a lesser-known platform may also help ensure the accessibility of your data over the long term, because the large providers seem less likely to go out of business or to stop supporting their current APIs.

Conclusion

With products like Watchful, it has never been easier to produce a variety of rich training data sets. But to get the very most out of the data that Watchful helps you create, you must ensure that you will always be able to access your training data quickly, in whichever version you need. Toward that end, consider strategies such as using version control, backing up the data regularly, and storing data in a format that is likely to be supported long into the future.

‍