DataCleaner Changelog

What's new in DataCleaner 5.2.0

Jul 7, 2017

Both open source and commercial contain these changes:
IBM's ICU library used for better Unicode standard compatibility.
Hashing component added
Fixed Pentaho plugin
Extended dictionary matcher (added possibility to ignore diacritic signs)
Added scroll bar for value distribution (result) page
Made connection lines lighter in desktop graphical job representation
Monitor is no longer available for the community edition
Commercial only:
Express Data Prep: a brand new wizard on the home screen of DataCleaner desktop. In just a few steps, it allows users to create a complete cleansing job for UK contact data and push it to DataCleaner monitor for repeated execution on data coming in from a hot folder.
Monitor: updated Cron expression errors handling
Monitor: improved validation of uploading files
Monitor: status of last job execution is now visible on schedule page
Monitor: now putting the actual data file in the hot folder can be used as a trigger to start a job, which then uses that data file as source data store
Duplicate Detection: updated to newest version, incl. performance better training with low column count.
Wizards made pluggable in DC Desktop.
Extended license information panel
Added Salutation generation transformer.
Added Sample select transformer.
Packaged new 'Name & Company Correction' component into DC Enterprise edition
Fixed Sanction list check component

New in DataCleaner 5.1.5 (Mar 14, 2017)

New in DataCleaner 5.1.4 (Nov 9, 2016)

New in DataCleaner 5.1.3 (Sep 15, 2016)

New in DataCleaner 5.1.2 (Sep 15, 2016)

New in DataCleaner 4.5 (Nov 5, 2015)

Output data streams:
At the engine and API level we've added the concept of "output data streams", which means that every component can publish streams of data that can be consumed by other components. Users of our API can utilize this feature by implementing the HasOutputDataStreams interface.
If this sounds too technical for you, just appreciate that this capability is underlying the following 4 features/improvements.
Duplicate detection and merging in the same job:
With the major updates we did to the UI in DataCleaner 4.0, it became clear that our users are also becoming more and more empowered to do more elaborate tasks using DataCleaner. One of the most frequent limitations we encountered in this respect, was that it was not possible to combine two complex tasks like duplicate detection and merging in a single job. Yet to experienced users it is a very useful scenario. So with the use a new data stream originating from Duplicate detection, you can now combine this job with either duplicate merging or any other duplicate post-processing step you might have.
Example job containing standardization, duplicate detection, merging and writing
Combine tables and data sources using the Union component:
We have added a core transformation function to DataCleaner called 'Union'. The functionality of this transformation is comparable of a SQL UNION function - to append two or more datasets together as if they are one. In other words: If you have multiple data sources, or just multiple tables, with the same type of content, then you can use the Union component to utilize them as if they were one big table.
The Union transformation can be used in conjunction with a Composite datastore. That way you can combine data from different data source such as CSV files, relational databases, ElasticSearch indices or Salesforce.com (to give a few examples).
Check if your contacts have moved or passed away - and update your source - all in the same job.
Via the Neopost family of data quality companies we have integrated several address correction, movers checks, deceased checks and similar services for specific regions. Currently we cover the United Kingdom, United States of America, Germany and the Netherlands with such functionality. With DataCleaner 4.5, using these functions has become a lot easier since the flexibility in integrating these services with the use of output data streams means that you can both perform checks, get reports on the results and do the post-processing of the results in a single job!
Example result-screen report from UK movers, deceased and do-not-mail check.
Process the complete or incomplete records found by Completeness analyzer
Completeness is one of the major dimensions of data quality, and DataCleaner addresses this topic with the Completeness analyzer, as well as filtering techniques. In DataCleaner 4.5 the analysis of completeness no longer necessarily ends with the incomplete records. You can now also use the Completeness analyzer as a intermediate step - feeding e.g. the complete or incomplete records into automated post-processing steps.
Connect DataCleaner to its big sister, DataHub:
Did you know that DataCleaner is a key architecture piece of the Human Inference/Neopost customer MDM solution, DataHub? DataHub serves the enterprise market for customer MDM and single customer view, and we've been improving the integration a lot in this release of DataCleaner - most noticeably with the DataHub connector, which allows DataCleaner users to seamlessly consume data from and publish to the DataHub.
The processing pipeline in DataHub.
Product data components: GTIN, EAN, VIN
We have added a new category of Data Quality functions which revolve around Product data.
New 'Product data' category.
With these functions, and more to come in the future, we are building a suite of ready-to-use components that validate and standardize the use of common industry codes for products in your database.
Component library restructured
The component library structure has been revisited and we've designed this so that the menus and search function are optimized for the tasks at hand. As you can also see from the screenshot above, the Improve category has changed a lot - now focusing more on specific domains of data and data quality checks.
Secure ElasticSearch connections with Shield
We now support ElasticSearch with Shield-based security. The connection you define for an ElasticSearch index can be reused both for a regular read/write style datastore, but also for searches, matching and other functions that we provide which integrate with ElasticSearch.
Easy access to the Hadoop Distributed File System (HDFS)
As Hadoop is becoming more and more the system of choice for Big Data projects, we've decided to improve the user experience for analyzing and working with files located on HDFS, the Hadoop Distributed File System.
Browsing HDFS to select your DataCleaner source data.
Now browsing and finding files on HDFS is just as convenient as it has always been on your local machine. It's no secret that the roadmap for DataCleaner involves tighter and tighter integration with Hadoop, and this is our first step to make the Hadoop-DataCleaner experience both effective and pleasant.
A new Delete from table component
We have added a component in the "Write" category that deletes records from a table in a datastore. Use in conjunction with filtering functions to e.g. delete dirty records or non-survivors found after merging duplicates.
Online component library reference
A lot has been done to further improve our reference documentation. In addition to updated chapters etc. we've launched the Component library online which provides a quick way to navigate documentation on an individual component level.
We're confident that you will enjoy the improved Datacleaner. Version 4.5 is a major step and we are proud to share it with you!

New in DataCleaner 4.0.10 (Sep 22, 2015)

New in DataCleaner 4.0.9 (Sep 22, 2015)

Improvements and new features:
We've made it possible to create and drop tables via the desktop UI of DataCleaner. Note that the term "table" here actually covers more than just relational database tables. It also includes Sheets in MS Excel datastores, Collections in MongoDB, Document types in CouchDB and ElasticSearch and so on... Basically all datastore types that support write-operations, except single-table datastores such as CSV datastores, support this functionality! The functionality is exposed via:
"Create table" enabled via the right-click menu of schemas in the tree on the left side of the application.
"Create table" enabled also via table-selection inputs in components such as Insert into table, Table lookup and Update table.
"Drop table" enabled via the right-click menu of tables in the tree on the left side of the application.
We've added the (optional) capability of specifying your Salesforce.com web service Endpoint URL. This allows you to use DataCleaner to connect to sandbox environments of Salesforce.com as well to your own custom endpoints.
The ElasticSearch support has been improved, allowing custom mappings as well as reusing the ElasticSearch datastore definitions now also for searching and indexing.
The sampling of records and selection of potential duplicates in the Duplicate detection function has been improved, leading to faster configuration because the decisions made during the training session are more representative.
The Duplicate detection model file format has been updated which has removed the need for a separate 'reference' file in order to save past training decisions. Compatibility with the old format has been retained, but using the new format adds many benefits for the user experience.
Bugfixes:
A thread starvation issue was fixed in DataCleaner monitor. The impact of this issue was great, but it happened only in rare and very customized cases. If custom listener objects on the DataCleaner monitor would throw an error, it would result in a resource never being freed up and taking up a thread from the Quartz-scheduling pool on the server. If this would happen many times the server could eventually run out of threads in that pool.
The vertical menu on the result screen is now doing a proper job of displaying the labels of the components that have results. This makes it easier to recognize which menu item points to what result item.

New in DataCleaner 4.0.7 (Sep 22, 2015)

New in DataCleaner 4.0.5 (Sep 22, 2015)

Following our DataCleaner 4.0 release little over a month ago we have received an impressive amount of feedback. As with any major software release, the feedback certainly sparks many creative ideas and also makes us aware of things to improve. So thank you all for that.
It's because of the great feedback that we can today announce the availability of DataCleaner 4.0.5. This version of DataCleaner adds on top of the existing functions and features in DataCleaner 4.0, making them even more powerful. There is obviously also a number of minor bugfixes included in this release. Let's walk you through it:
Combined component requirements:
It's now possible to combine many component requirements into one. This especially makes sense if you have a graph of validation/correction tasks and you wish to catch all invalid entries into the same "bucket" of rejected records or so. Here's an example:
Search in component library:
We've added a search box to the component library of DataCleaner desktop. This makes it a lot easier to locate the component you're thinking of or to find components of relevance to what you have in mind.
Results from non-analyzers:
Until now it has been so that only components of the technical type 'Analyzer' can produce a result. This made a clear distinction between the tasks of data correction/transformation and tasks that produced reports/results that could be displayed to the user. We have relaxed this distinction a bit, allowing transformation components also to produce a result. For now we only have a few examples of this ('Table lookup' and 'Country standardizer'), but more will certainly come in the future.
Less file-management in Duplicate detection:
The configuration of the very popular Duplicate detection component was made a bit simpler by no longer requiring the user to consider file-location of the duplicate matching model. Now this file location is based on a default (which can of course be overridden by the user if wanted).
And much more:
More than 10 minor bugfixes was addressed. A helping "Component description" documentation option was added, as well as improving the general reference documentation which now holds more tutorials and explanations of all the functionality in DataCleaner.

New in DataCleaner 4.0.1 (Apr 15, 2015)

Visual graph-based job building:
A new visual way of building jobs instead of navigating though component tabs makes all of the difference for the user and his capabilities. In DataCleaner 4.0 you benefit from a clear picture of how your data is going to be processed.
The canvas not only shows the contents of your job. It also provides hints and guidance while you build it. Modifying the job is a matter of interacting with the nodes in the graph.
All the components "within a click’s distance":
In order to find data quality functions easier, the "Transform-Analyze" menu was made into an easy-to-navigate part of the tree structure on the left side of the application. The categories have been divided into "Transform-Improve-Analyze-Write" which makes for a clearer separation of components based on the type of task they help you with. To add a function - just drag it onto the job graph canvas.
Quick Start Wizards:
New users will benefit from the welcome screen guiding them how to make the first steps in the application. Commercial editions (read more) of DataCleaner include Quick Start Wizards that will answer the questions you might have about your data. Instead of manually assembling a job, the wizard asks a couple of questions and generates a job you can start your journey with. Such a job can be tweaked later on, if needed or just executed to see the insights.
Welcome screen:
Along with the wizards, the new welcome screen also changes the way new jobs are built. Click “New job from scratch” button to make use of drag-and-drop support or "Manage datastores" to work in a way known from previous versions of DataCleaner.
Refreshed look and feel:
The visual part of the user interface (icons, colors etc.) has been revisited. The new clean and modern look should make working with DataCleaner more pleasant.
UK/US/DE Address Correction and Suppression features:
DataCleaner has new components that integrate with UK, US and German address correction and suppression services from our partners. Now, without leaving DataCleaner's job workflow you can consult external databases for information about movers, do-not-mail declarations and verify the accuracy of address details.
Improvements to deduplication:
Deduplication scenarios have been reconsidered. Try our new “Untrained detection” mode for instant results with minimum configuration. The previous "Training Tool" and "Duplicate Detection" functions have been merged into one component. This eliminates the necessity of replacing Training Tool with Duplicate Detection on the way in order to fulfil the whole customized deduplication process.
ElasticSearch and Apache Cassandra connectivity:
We continue to expand our portfolio of supported databases. In the DataCleaner 4.0 release we are happy to announce that we now support two new NoSQL databases: ElasticSearch (read+write) as well as Apache Cassandra (readonly).

New in DataCleaner 3.7.2 (Dec 11, 2014)

New in DataCleaner 3.7 (Oct 31, 2014)

New in DataCleaner 3.6.2 (Jul 18, 2014)

New in DataCleaner 3.6.1 (Jul 18, 2014)

New in DataCleaner 3.6 (May 20, 2014)

New in DataCleaner 3.5.10 (Mar 15, 2014)

You can now compose jobs so that a DataCleaner job actually calls/invokes another "child" job as a single transformation. This is an important feature because it allows users to organize and compose complex data processing flows into smaller chunks of work. The new "Invoke child Analysis Job" transformation inlines the transformation section of the child job at execution time, which means that there is practically no overhead to this approach.
As a convenience for the above scenario, it is now allowed to save jobs without any analysis section in them. These jobs will thus be "incomplete", but that might actually be the point when composing and putting jobs together.
Another new transformation was added: Coalesce multiple fields. This transformation is useful for scenarios where multiple sets of fields are interchangeable, or when multiple interchangeable transformations produce the same set of fields. The "coalesce" transformation can roughly be translated into "pick the first non-empty values". When there's multiple sets of fields in your data processing stream, for instance multiple address definitions, and you need to select just one, then this is very convenient.
The handling of source columns has been simplified. Previously we tried to limit the source queries based upon only the source columns that where strictly needed to perform the analysis. But many users gave us the feedback that this caused trouble because the drill-to-detail information available in the analysis results would then be missing important fields for further exploration. So the power is now in the hands of the users: The fields added in the "Source" section of the job are the fields that will be queried
A change was made to the execution engine in dealing with complex filtering and requirement configurations. Previously, if a component (transformation or analysis) consumed inputs from other components, ALL requirements had to be satisfied, which mostly just causes the requirement to never become true. Now the logic has been changed to be inclusive so that if any of the direct input sources' requirements are satisfied, then the component's inferred requirement is also satisfied. Most users will not notice this change, but it does mean that it is now possible to merge separate filtered data streams back into a single stream
An issue was fixed in the access to repository files. Read/write locking is now in place which avoids access conflicts by different processes
The 'requirement' button in DataCleaner has also been reworked. It did not always properly respond to changes in other panels, but now it is consistent
Finally, the 'About' dialog was improved slightly and now contains more licensing information

New in DataCleaner 3.5.7 (Mar 15, 2014)

New in DataCleaner 3.5.6 (Nov 22, 2013)

New in DataCleaner 3.5.5 (Oct 25, 2013)

New in DataCleaner 3.5.4 (Sep 5, 2013)

New in DataCleaner 3.5.3 (Jul 1, 2013)

New in DataCleaner 3.5.1 (Jun 12, 2013)

New in DataCleaner 3.5 (Jun 12, 2013)

Connectivity to Salesforce and SugarCRM:
One of the most important sources of data is usually a company's CRM system. But it is also one of the more troublesome data sources if you look at the quality. For this reason we've made it easier to get the data out of these CRM systems and into DataCleaner! You can now use your Salesforce.com or your local SugarCRM system as if it was a regular database. Start by profiling the customer data to get an overview. But don't stop there - you can even use DataCleaner to also update your CRM data, once it is cleansed. More details are available in the brand new focus article about CRM data quality.
Wizards and other user experience improvements:
The DataCleaner monitor is our main user interface going forward. So we want the experience to be at least as pleasant, flexible and rich as the desktop application. To meet this goal, we've made many user interface and user experience improvements, amongst others:
Several wizards are now available for registering datastores; including file-upload to the server for CSV files, database connection entry, guided registration of Salesforce.com credentials and more.
The job building wizards have also been extended with several enhanced features; Selection of value distribution and pattern finding fields in the Quick analysis wizard, a completely new wizard for creating EasyDQ based customer cleansing jobs and a new job wizard for firing Pentaho Data Integration jobs.
You can now ad-hoc query any datastore directly in the web user interface. This makes it easy to get quick or sporadic insights into the data without setting up jobs or other managed approaches of processing the data.
Once jobs or datastores are created, the user is guided to take action with the newly built object. For instance, you can very quickly run a job right after it's built, or query a datastore after it is registered.
Administrators can now directly upload jobs to the repository, which is especially handy if you want to hand-edit the XML content of the job files.
A lot of the technical cruft is now hidden away in favor of showing simple dialogs. For instance, when a job is triggered a large loading indicator is shown, and when finished the result will be shown. The advanced logging screen that was previously there can still be displayed upon clicking a link for additional details.
Distributed execution of jobs:
To keep up with the massive amounts of data that many organizations are juggling with today, we had to take a critical look at how we process data in DataCleaner. Although DataCleaner is among the fastest data processing tools, it was previously limited to running on a single machine. For a long time we've been working on a major architecture change that enabled distribution of a DataCleaner job's workload over a cluster of machines. With this new approach to data processing, DataCleaner is truly fit for data quality on big data.
Data visualization extension:
Data profiling and data visualization do share some common interests - both are disciplines that help you understand the story that your data is telling. There are obviously also some differences, mainly being that data profiling is more targeted at identifying issues and exceptions rather than deriving or measuring business objectives. But confronted with visualization tools we've realized that sometimes there's a lot of profiling value in progressively visualizing data. For instance, a scatter plot can easily help you identify the numerical outliers of your datasets. This idea gave fuel to the idea of a visualization extension to DataCleaner. Therefore DataCleaner now also let's you do basic visualization tasks to aid you in your data quality analysis.
National identifiers extension:
A very common issue in data quality projects is to validate national identifiers, such as social security numbers, EAN codes and more. In our commercial editions of DataCleaner, we now offer a wide range of validation components to check such identifiers.
Custom job engines:
We've made the ultimate modularization of the DataCleaner monitoring system: The engine itself is a pluggable module. While we do encourage to use DataCleaner's engine as the primary vehicle for execution in DataCleaner monitor, it is not obligatory anymore. You can now schedule and monitor (both in terms of metric monitoring and history management) other types of jobs. For instance, you can provide your own piece of Java code and have it scheduled to run in DataCleaner monitor using the regular web user interface.
Pentaho job scheduling and execution:
One major example of a pluggable job engine was introduced that we think deserves special attention: You can now invoke and monitor execution metrics of Pentaho Data Integration transformations. DataCleaner monitor by default ships with this job engine extension which connects to the Pentaho DI server ("Carte") and supervises the execution and result gathering of it. After execution you can track your Pentaho transformations in the timeline views of the monitoring dashboard, just like other metrics. For larger deployments of DataCleaner it may be convenient with dedicated ETL-style jobs in your data quality solution, and with this extension we provide an integration with a leading open source solution for just that.

New in DataCleaner 3.1.2 (Jan 23, 2013)

New in DataCleaner 3.0.3 (Dec 14, 2012)

New in DataCleaner 3.0.2 (Oct 12, 2012)

New in DataCleaner 3.0.1 (Oct 2, 2012)

New in DataCleaner 3.0 (Sep 20, 2012)

New in DataCleaner 2.5.2 (May 2, 2012)

New in DataCleaner 2.5.1 (Apr 11, 2012)

New in DataCleaner 2.4.1 (Jan 3, 2012)

New in DataCleaner 2.2 (Jun 27, 2011)

Extensibility:
The main driver for this release has been a story about extensibility. While releasing the application we are simultaniously releasing a a new DataCleaner website which features an important new area: The ExtensionSwap. The idea of the ExtensionSwap is to allow sharing of extensions to DataCleaner and installation simply by clicking a button in the browser!
The DataCleaner extension API has been improved a lot in this release, making it possible to create your own transformers, analyzers and filters. If you feel your extensions could be of interest to other users, please share it on the ExtensionSwap and we provide a channel for you to easily distribute it to thousands of users. The Extension API and the ExtensionSwap is further explained in our new webcast demonstration for developers and other techies with an interest.
We are also releasing a set of initial extensions on the ExtensionSwap: The HIquality Contacts for DataCleaner extension which provides advanced Name, Phone and Email cleansing, based on Human Inferences natural language processing DQ web services. We are also shipping a sample extension which will serve as an example for developers wanting to try out extension development themselves. In the coming months we will make sure to post even more extensions originating from our internal portfolio of tools that we use at Human Inference's knowledge gathering teams.
In addition to extensibility we are also focusing on embeddability. We want to be able to embed DataCleaner easily into other applications to make profiling and data analysis possible anywhere! We've created a new bootstrapping API which allows applications to bundle DataCleaner and bootstrap it with a dynamic configuration or run it in a "single datastore mode", where the application is tuned towards just inspecting a single datastore (typically defined by the application that embeds DataCleaner). We already have some really interesting cases of embedding DataCleaner in the works - both in other open source applications as well as commercial applications.
Compatibility:
We've added support for analyzing SAS data sets. This is something we're quite proud of as we are, to our knowledge, the first major open source application to provide such functionality, ultimately liberating a lot of SAS users. The SAS interoperability part was created as a separate project, SassyReader, so we expect to see adoption in DataCleaner's complimentary open source communities soon too!
We've also added support for another type of datastore: Fixed width files. Fixed width files are text files where each column has a fixed width. There is no separator or quote character, like CSV files, instead each line are equal in length and each line will be tokenized according to a set of value lengths.
An option to "fail on inconsistencies" was added to CSV file and fixed width file datastores. These flags add a format integrity check when using these text file based datastores.
A bug was fixed, which caused CSV separator settings not to be retained in the user interface, when editing a CSV datastore.
Japanese and other characters are not supported in the user interface. This "bug" was a matter of investigating available fonts on the system and selecting a font that can render the particular characters. On most modern systems there will be capable fonts available, but on some Unix/Linux branches there might still be limitations.
Other improvements:
The documentation section has been updated! Ever since the initial 2.0 release the documentation have been far behind, but we've finally managed to get it up to date. There are still pieces missing in the docs, but it should definately be useful for basic usage as well as a reference for most topics.
Application startup time was improved by parallelizing the configuration loading and by delaying the initialization of those parts of the configuration that are not needed for the initial window display.
The phonetic similarity finder analyzer have been removed from the main distribution, as this was quite experimental and serves mostly as a proof of concept and an appetizer to the community to create more advanced matching analyzers. You can now find and install the phonetic similarity finder on the ExtensionSwap.
Cancelled or errornous job handling was improved and the user interface responds more correctly by disabling buttons and progress indicators, if a job has stopped.
Fixed a few minor UI issues pertaining to table sizing and use of scrollbars.

DataCleaner Changelog

What's new in DataCleaner 5.2.0

New in DataCleaner 5.1.5 (Mar 14, 2017)

New in DataCleaner 5.1.4 (Nov 9, 2016)

New in DataCleaner 5.1.3 (Sep 15, 2016)

New in DataCleaner 5.1.2 (Sep 15, 2016)

New in DataCleaner 4.5 (Nov 5, 2015)

New in DataCleaner 4.0.10 (Sep 22, 2015)

New in DataCleaner 4.0.9 (Sep 22, 2015)

New in DataCleaner 4.0.7 (Sep 22, 2015)

New in DataCleaner 4.0.5 (Sep 22, 2015)

New in DataCleaner 4.0.1 (Apr 15, 2015)

New in DataCleaner 3.7.2 (Dec 11, 2014)

New in DataCleaner 3.7 (Oct 31, 2014)

New in DataCleaner 3.6.2 (Jul 18, 2014)

New in DataCleaner 3.6.1 (Jul 18, 2014)

New in DataCleaner 3.6 (May 20, 2014)

New in DataCleaner 3.5.10 (Mar 15, 2014)

New in DataCleaner 3.5.7 (Mar 15, 2014)

New in DataCleaner 3.5.6 (Nov 22, 2013)

New in DataCleaner 3.5.5 (Oct 25, 2013)

New in DataCleaner 3.5.4 (Sep 5, 2013)

New in DataCleaner 3.5.3 (Jul 1, 2013)

New in DataCleaner 3.5.1 (Jun 12, 2013)

New in DataCleaner 3.5 (Jun 12, 2013)

New in DataCleaner 3.1.2 (Jan 23, 2013)

New in DataCleaner 3.0.3 (Dec 14, 2012)

New in DataCleaner 3.0.2 (Oct 12, 2012)

New in DataCleaner 3.0.1 (Oct 2, 2012)

New in DataCleaner 3.0 (Sep 20, 2012)

New in DataCleaner 2.5.2 (May 2, 2012)

New in DataCleaner 2.5.1 (Apr 11, 2012)

New in DataCleaner 2.4.1 (Jan 3, 2012)

New in DataCleaner 2.2 (Jun 27, 2011)

New in DataCleaner 2.1 (Apr 19, 2011)