Machine Learning Problems: The Easy Parts
SEPTEMBER 21, 2017
So, we implemented de-duplication algorithms to significantly reduce the resources required to process the information in these documents. We therefore had to write our own version of hierarchical clustering that returns a large number of tight clusters or duplicates. And to make things complex, the feature set for de-duplication varies substantially from that for tagging or classification or topic modeling. which are not useful for de-duplication.