Machine Learning Problems: The Easy Parts
SEPTEMBER 21, 2017
So, we implemented de-duplication algorithms to significantly reduce the resources required to process the information in these documents. The features are based on the frequency and importance of entities among other things (discussed at a later point in this post). We therefore had to write our own version of hierarchical clustering that returns a large number of tight clusters or duplicates. Inverse document frequency. which are not useful for de-duplication.