In answering this question, it's helpful to hold in mind how the rules and heuristics are used. First the rules are applied across all of the inputs to produce a partially-labelled dataset, where many datapoints have more than one label. Then a probabilistic model is used to aggregate the various labels on each datapoint into one most likely label. Finally we train a discriminative model.