Not sure how relevant this is, but in my research I found that AL can fail to outperform random sampling if the decision boundary is fractal, which is common in chaotic systems: https://arxiv.org/abs/2311.18010
Both this perspective and the paper were very interesting! I do think that statistics / ML will never provide a full accounting of the value of replicates, however. We treat the data generating process as given, with replicates being samples from that distribution. But in biology, the data generating process is not fixed, but is something to be evaluated, debugged, and optimized. If you mess up your biological assay, your measurements may become biased, or at the very least have higher variance. Using replicates is really just the first step in detecting and addressing such problems, and it's hard to see how the formalisms of active learning can capture this.
Great point, we have less understanding of how to model biological replicates (hard, unknown unknowns) vs technical replicates (easier). IMO, putting it all into aleatoric noise is semi-acceptable (as in our preprint), but you'd probably want some way of identifying extreme outliers indicating a failed experiment. WDYT?
I think modeling technical replicates is still really hard. One example is the case of cell-free protein synthesis for antibody engineering, where in theory everything is a technical replicate because you're basically just measuring chemistry without biological variation. But it's not the case that different preparations (only some of which are legible and can be controlled for) will only differ in terms of the variance of the measurement error. For certain antibody sequences, some preparations will give systematically "wrong" answers. And I put "wrong" in quotation marks, because the wrongness is often not clear-cut, but is instead inferred from looking at a broad set of experimental results.
Not sure how relevant this is, but in my research I found that AL can fail to outperform random sampling if the decision boundary is fractal, which is common in chaotic systems: https://arxiv.org/abs/2311.18010
Both this perspective and the paper were very interesting! I do think that statistics / ML will never provide a full accounting of the value of replicates, however. We treat the data generating process as given, with replicates being samples from that distribution. But in biology, the data generating process is not fixed, but is something to be evaluated, debugged, and optimized. If you mess up your biological assay, your measurements may become biased, or at the very least have higher variance. Using replicates is really just the first step in detecting and addressing such problems, and it's hard to see how the formalisms of active learning can capture this.
Great point, we have less understanding of how to model biological replicates (hard, unknown unknowns) vs technical replicates (easier). IMO, putting it all into aleatoric noise is semi-acceptable (as in our preprint), but you'd probably want some way of identifying extreme outliers indicating a failed experiment. WDYT?
I think modeling technical replicates is still really hard. One example is the case of cell-free protein synthesis for antibody engineering, where in theory everything is a technical replicate because you're basically just measuring chemistry without biological variation. But it's not the case that different preparations (only some of which are legible and can be controlled for) will only differ in terms of the variance of the measurement error. For certain antibody sequences, some preparations will give systematically "wrong" answers. And I put "wrong" in quotation marks, because the wrongness is often not clear-cut, but is instead inferred from looking at a broad set of experimental results.