Duke will provide users with a fast and flexible deduplication (entity resolution or record linkage) engine written in Java on top of Lucene. At the moment it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread.
Duke can be used to find duplicate records inside a single table/data source, or it can be used to find records in different tables/sources which most likely represent the same real-world entity.
NOTE: Duke also runs on Mac and Linux platforms.
Requirements:
· Java
What's New in This Release: [ read full changelog ]
Performance improvements:
· Support for multi-threading added
· Using NIOFSDirectory on all platforms except Windows
· New in-memory backend, faster than Lucene (experimental)
Changes to Comparators:
· Geo-coordinate comparator added.
· Q-grams comparator added.
· Levenshtein implementation is now faster
· Weighted Levenshtein weight estimator now knows position in string ( issue 81 )
Changes to Cleaners:
· Added PhoneNumberCleaner
· Extended and generalized regexp cleaner
· Removed sub-cleaner concept, added support for multiple cleaners
Other improvements:
· Implemented user control over lookup props
· Upgraded to Lucene 4.0
· Added MatchListener.startProcessing() callback
· Removed some MatchListener callback methods (weren't thread-safe)
· InMemoryLinkDatabase now complete and tested
· LinkDatabaseMatchListener bug fixes
· Better validation of configurations
· JDBCEquivalenceClassDatabase added
· RDBMSLinkDatabase performance improvement
Changes to command-line client:
· Added data debug mode
· Fixed bug with reu...