Multi-head objectives

This issue addresses the:

Multi-head objectives: Allow multiple non-weight-tied versions of the same head to have different losses

item on the roadmap.

This also touches on the unmerged MR in https://gitlab.kitware.com/smart/watch/-/merge_requests/508 which introduces the idea of "predictable_classes".

I'm motivated to support the following case. Imagine you have two datasets with different categories and each dataset contains unlabeled instances of categories from the other dataset. However, some of the categories are in common or the domain is similar enough where it would be beneficial for the network to be exposed to the extra data and supervision. E.g. the shitspotter + TACO dataset or xView + xView2 + SMART + resisc45.

We could support this by having multiple classification heads that target each individual problem. However, when training these heads we don't want the network to be penalized based on what it predicts for a head that is not associated with the underlying data.

There are 5 questions that need to be answered:

How do we represent this union of disjoint class datasets in KWCOCO?
What are good default setting such that the user isn't required to specify anything more than "train modelX on datasetY"?
What could the user want to specify in order to overwrite default behavior?
How do we construct the network and loss?
How do we specify multiple variants of heads with the same classes?

For the first answer, we have to consider how would we do kwcoco union --dst combo.kwcoco.zip -- dataset1, dataset2, ..., and what do we expect in the combo dataset? First, I think the categories are unioned as-is. Categories with the same name are considered duplicates. If they don't contain the same metadata the user is warned, and an arbitrary subset of the metadata is chosen (this could be improved). Each image or video could then be assigned the set of classes that is relevant to it, either by encoding their category IDs as a list or perhaps through some other mechanism. The requirement is that each image knows which categories should be labeled.

For default settings, we currently default our network to include a category head based on all of the categories in a kwcoco file. I think this is a fine default behavior, and it would be fine to leave it as is. But it could be improved the network detected the number of distinct category-id sets are labeled in the dataset and create a head for each of them.

What if the user wants to override the default behavior. I'm thinking that the user can specify --predictable_classes=<ChannelSpec>, where the channel spec is the same one we use for inputs e.g. --predictable_classes="(Site Preparation|Active Construction),(building),(nodamage|minordamage|majordamage|destroyed),(tennis_court|river|church|circular_farmland). Then for each FusedChannelSpec stream we construct a head for those predictable classes.

How do we construct the network? We create a torch ModuleDict and map each fused-channel-spec code to its own head. In the forward pass each of them is called. To compute the loss, we must construct the appropriate truth in the KWCocoVideoDataset. Because each image knows which categories belong to it, we know which channels are relevant to a sampled target. We compare these to each set of predictable classes producable by the network, and if that network head channel spec is a subset of the target predictable channels, we construct a truth tensor that is labeled with the channel spec for the appropriate head. Thus given the the outputs for each head, we can associate them with the available truth given by the KWCocoVideoDataset, and then perform the loss calculation.

To allow multiple versions of each head (e.g. we want two (Site Preparation|Active Construction) one trained with focal loss and the other trained with dicefocal), we need to differentiate them. We could use the sensorchan spec to do this, although that is somewhat an abuse of the data structure. I'm open to better ways to handle this, but I think it works well enough, and it does make some amount of sense.

--predictable_classes="H1:(Site Preparation|Active Construction),H2:(Site Preparation|Active Construction)"

or more consicely,

--predictable_classes="(H1,H2):(Site Preparation|Active Construction)"

Then, the user could specify some mapping:

"head_to_loss": 
    "H1:(Site Preparation|Active Construction)" : "focal"
    "H2:(Site Preparation|Active Construction)" : "dicefocal"

@connor.greenwell Thoughts?

Edited Jan 26, 2024 by Jon Crall