Can programmatic labelling really match hand labelled data?

If you have equal volumes of manually annotated data and programmatic data then the manually annotated data will usually be better.
However, manually annotated data is usually much slower and expensive to obtain. A large programmatically generated dataset can on its own achieve similar results to manually annotated data and when combined with ground truth will often give a significant lift in performance.
Programmatic labelling improves significantly on just using a rules based system because the Label Model works out which rules to trust when.
A recent benchmark across numerous datasets for both classification and NER shows that pure programmatic labels can come very close and sometimes exceed manual annotation. See the table below:
The performance of the best gold method and top weak supervision methods for each dataset as reported in the WRENCH benchmark.. Each metric value is averaged over 5 runs.

If you can create good rules and heuristics, why do you still need to train a machine learning model?

The features that a discriminative model uses don't need to be the same ones that are used as inputs to your heuristics or rules. This means that:
  1. 1.
    The model can generalise beyond the weak rules because it extracts features that are more general.
    e.g. A weak rule for sentiment might use a bunch of keywords. The discriminative model trained on the data from the weak rule doesn't see those keywords and instead learns a more general feature about word order and keywords.
  2. 2.
    You can use inputs and features for your rules that may not be available at prediction time.
    e.g. Say that you have a dataset of medical images and accompanying doctor's notes and you want to train a classifier just on the images. You can write a set of rules that takes the textual notes and produces labels for the images. The rules are not useful for the image classifier but the output of the rules is.
Training the discriminative model on the de-noised rules, almost always adds a very significant bump in performance.
In answering this question, it's helpful to hold in mind how the rules and heuristics are used. First the rules are applied across all of the inputs to produce a partially-labelled dataset, where many datapoints have more than one label. Then a probabilistic model is used to aggregate the various labels on each datapoint into one most likely label. Finally we train a discriminative model.
When viewed this way, you can see that asking why a discriminative model can generalise on the rules is entirely equivalent to asking how any model can generalise beyond hand-labelled data. In both cases you provide a model with a (noisily) annotated dataset and hope it can learn to generalise to new instance.