Add hierarchial pretraining algorithm
This PR adds the hierarchical pretraining (HPT) capabilities, where the pipeline looks something like: pretrained source_network => additional [self-supervised] pretraining on {source, source+target, source_then_target, target} => domain_adaptation.
HPT operates by calling out to a child process (using torch distributed), and in order to do this, it dynamically generates the HPT configs, writes them to disk, writes the source network to disk, writes the appropriate data lists to disk, then executes the HPT process, then rereads the updated network weights from disk and applies them back to the source network.
I followed BU-NLP
as much as I could, in terms of the "correct" way to do this.
I'm still doing some testing across the various ways to use the LEARN pipeline, but I think the PR is in good enough shape for initial review =)