The performance doesn't differ very much for all the variants, that's why I want to make a simple test with a small toy example to check the implementation + a simple test fot evaluation to check it.
As we do induction, our labels for clusters do differ from the labels for clusters in the gold standard, that's why it's important to understand how it influences the performance