DataCleaner Changelog

What's new in DataCleaner 5.2.0

Jul 7, 2017
  • Both open source and commercial contain these changes:
  • IBM's ICU library used for better Unicode standard compatibility.
  • Hashing component added
  • Fixed Pentaho plugin
  • Extended dictionary matcher (added possibility to ignore diacritic signs)
  • Added scroll bar for value distribution (result) page
  • Made connection lines lighter in desktop graphical job representation
  • Monitor is no longer available for the community edition
  • Commercial only:
  • Express Data Prep: a brand new wizard on the home screen of DataCleaner desktop. In just a few steps, it allows users to create a complete cleansing job for UK contact data and push it to DataCleaner monitor for repeated execution on data coming in from a hot folder.
  • Monitor: updated Cron expression errors handling
  • Monitor: improved validation of uploading files
  • Monitor: status of last job execution is now visible on schedule page
  • Monitor: now putting the actual data file in the hot folder can be used as a trigger to start a job, which then uses that data file as source data store
  • Duplicate Detection: updated to newest version, incl. performance better training with low column count.
  • Wizards made pluggable in DC Desktop.
  • Extended license information panel
  • Added Salutation generation transformer.
  • Added Sample select transformer.
  • Packaged new 'Name & Company Correction' component into DC Enterprise edition
  • Fixed Sanction list check component

New in DataCleaner 5.1.5 (Mar 14, 2017)

  • We had a rather bad issue with 5.1.4, so we decided to make a quick release of 5.1.5.
  • Fixes:
  • Running DataCleaner on OSX could fail.
  • CSV, Excel and staging writers could not edit output column names, and could cause lockups in the desktop edition.
  • We recommend that all users upgrade to 5.1.5, especially if you're using 5.1.4.

New in DataCleaner 5.1.4 (Nov 9, 2016)

  • BUG FIXES:
  • Make requirements and scope buttons update properly when job changes
  • Stability fixes to Union and Coalesce components
  • Stability fix for Excel writer
  • Fix jTDS single-connection-multiple-threads writing exception
  • Save button only enabled when a job is being built
  • Monitor: Metrics from output datastreams available for timelines.
  • Monitor: Bad repository file upload does not delete old repository
  • Monitor: JavaScript transformer now works in monitor
  • OTHER NICE IMPROVEMENTS:
  • Improve job loading and execution startup times
  • Reduce Excel writer validation overhead
  • Improve flexibility of date range filter
  • License overview panel (commercial only)
  • URL Parser results improved
  • Monitor: Improve scheduling page load times
  • Monitor: Old repository backed up when new repository is loaded
  • Monitor: Reference data can now be configured in UI
  • There's a known issue that upgraders should be aware of:
  • Duplicate Detection results saved in previous versions cannot be opened in 5.1.4

New in DataCleaner 5.1.3 (Sep 15, 2016)

  • 5.1.3 is mostly a bugfix release, containing the following fixes:
  • Remove spurious logging from grouper and UI.
  • Remove non-functioning view online regex button.
  • Exported HTML report and monitor graphs should now work properly.
  • Remove error when setting up timeline for pattern finder analyzer with grouping.
  • Fix on-premise name correction (commerial only).
  • Multiple issues in deduplication component (commercial only).
  • We also have a single improvement:
  • Make duplicate detection training more intuitive (commercial only).

New in DataCleaner 5.1.2 (Sep 15, 2016)

  • Now you can trigger jobs in the monitor simply by dumping files into a folder. This is great for processing jobs that get delivered by other services. You can even use a .properties file if you need to change the configuration for the job execution.
  • We've also:
  • Extended our monitor UI to support adding HDFS datastores.
  • Added support for simple fixed-width mainframe/EBCDIC files
  • Made it possible to run a job on another configured datastore from the command line.
  • Extended the monitor REST job trigger to support setting configuration properties.
  • Added a news channel into DataCleaner, so you can catch new releases
  • .... And of course fixed bug and performance issues along the way.

New in DataCleaner 4.5 (Nov 5, 2015)

  • Output data streams:
  • At the engine and API level we've added the concept of "output data streams", which means that every component can publish streams of data that can be consumed by other components. Users of our API can utilize this feature by implementing the HasOutputDataStreams interface.
  • If this sounds too technical for you, just appreciate that this capability is underlying the following 4 features/improvements.
  • Duplicate detection and merging in the same job:
  • With the major updates we did to the UI in DataCleaner 4.0, it became clear that our users are also becoming more and more empowered to do more elaborate tasks using DataCleaner. One of the most frequent limitations we encountered in this respect, was that it was not possible to combine two complex tasks like duplicate detection and merging in a single job. Yet to experienced users it is a very useful scenario. So with the use a new data stream originating from Duplicate detection, you can now combine this job with either duplicate merging or any other duplicate post-processing step you might have.
  • Example job containing standardization, duplicate detection, merging and writing
  • Combine tables and data sources using the Union component:
  • We have added a core transformation function to DataCleaner called 'Union'. The functionality of this transformation is comparable of a SQL UNION function - to append two or more datasets together as if they are one. In other words: If you have multiple data sources, or just multiple tables, with the same type of content, then you can use the Union component to utilize them as if they were one big table.
  • The Union transformation can be used in conjunction with a Composite datastore. That way you can combine data from different data source such as CSV files, relational databases, ElasticSearch indices or Salesforce.com (to give a few examples).
  • Check if your contacts have moved or passed away - and update your source - all in the same job.
  • Via the Neopost family of data quality companies we have integrated several address correction, movers checks, deceased checks and similar services for specific regions. Currently we cover the United Kingdom, United States of America, Germany and the Netherlands with such functionality. With DataCleaner 4.5, using these functions has become a lot easier since the flexibility in integrating these services with the use of output data streams means that you can both perform checks, get reports on the results and do the post-processing of the results in a single job!
  • Example result-screen report from UK movers, deceased and do-not-mail check.
  • Process the complete or incomplete records found by Completeness analyzer
  • Completeness is one of the major dimensions of data quality, and DataCleaner addresses this topic with the Completeness analyzer, as well as filtering techniques. In DataCleaner 4.5 the analysis of completeness no longer necessarily ends with the incomplete records. You can now also use the Completeness analyzer as a intermediate step - feeding e.g. the complete or incomplete records into automated post-processing steps.
  • Connect DataCleaner to its big sister, DataHub:
  • Did you know that DataCleaner is a key architecture piece of the Human Inference/Neopost customer MDM solution, DataHub? DataHub serves the enterprise market for customer MDM and single customer view, and we've been improving the integration a lot in this release of DataCleaner - most noticeably with the DataHub connector, which allows DataCleaner users to seamlessly consume data from and publish to the DataHub.
  • The processing pipeline in DataHub.
  • Product data components: GTIN, EAN, VIN
  • We have added a new category of Data Quality functions which revolve around Product data.
  • New 'Product data' category.
  • With these functions, and more to come in the future, we are building a suite of ready-to-use components that validate and standardize the use of common industry codes for products in your database.
  • Component library restructured
  • The component library structure has been revisited and we've designed this so that the menus and search function are optimized for the tasks at hand. As you can also see from the screenshot above, the Improve category has changed a lot - now focusing more on specific domains of data and data quality checks.
  • Secure ElasticSearch connections with Shield
  • We now support ElasticSearch with Shield-based security. The connection you define for an ElasticSearch index can be reused both for a regular read/write style datastore, but also for searches, matching and other functions that we provide which integrate with ElasticSearch.
  • Easy access to the Hadoop Distributed File System (HDFS)
  • As Hadoop is becoming more and more the system of choice for Big Data projects, we've decided to improve the user experience for analyzing and working with files located on HDFS, the Hadoop Distributed File System.
  • Browsing HDFS to select your DataCleaner source data.
  • Now browsing and finding files on HDFS is just as convenient as it has always been on your local machine. It's no secret that the roadmap for DataCleaner involves tighter and tighter integration with Hadoop, and this is our first step to make the Hadoop-DataCleaner experience both effective and pleasant.
  • A new Delete from table component
  • We have added a component in the "Write" category that deletes records from a table in a datastore. Use in conjunction with filtering functions to e.g. delete dirty records or non-survivors found after merging duplicates.
  • Online component library reference
  • A lot has been done to further improve our reference documentation. In addition to updated chapters etc. we've launched the Component library online which provides a quick way to navigate documentation on an individual component level.
  • We're confident that you will enjoy the improved Datacleaner. Version 4.5 is a major step and we are proud to share it with you!

New in DataCleaner 4.0.10 (Sep 22, 2015)

  • So what is new in this release?
  • We have added support for connecting to Apache Hive via our existing JDBC interface. In addition to the existing Apache HBase connectivity, this is a good first step towards having DataCleaner as a Big Data Profiling engine for your data in Hadoop.
  • A bug related to version-conflicts in the Apache HBase connector was fixed.
  • We have made it easier to immediately register new datastores when you need them. For instance when configuring a Table lookup or the Insert into table component.
  • The result window has been improved slightly, now prioritizing analysis component reports over e.g. results from transformations.
  • Finally the reference documentation has been updated a lot, plus the index in the documentation now contains better sub-sectioning.

New in DataCleaner 4.0.9 (Sep 22, 2015)

  • Improvements and new features:
  • We've made it possible to create and drop tables via the desktop UI of DataCleaner. Note that the term "table" here actually covers more than just relational database tables. It also includes Sheets in MS Excel datastores, Collections in MongoDB, Document types in CouchDB and ElasticSearch and so on... Basically all datastore types that support write-operations, except single-table datastores such as CSV datastores, support this functionality! The functionality is exposed via:
  • "Create table" enabled via the right-click menu of schemas in the tree on the left side of the application.
  • "Create table" enabled also via table-selection inputs in components such as Insert into table, Table lookup and Update table.
  • "Drop table" enabled via the right-click menu of tables in the tree on the left side of the application.
  • We've added the (optional) capability of specifying your Salesforce.com web service Endpoint URL. This allows you to use DataCleaner to connect to sandbox environments of Salesforce.com as well to your own custom endpoints.
  • The ElasticSearch support has been improved, allowing custom mappings as well as reusing the ElasticSearch datastore definitions now also for searching and indexing.
  • The sampling of records and selection of potential duplicates in the Duplicate detection function has been improved, leading to faster configuration because the decisions made during the training session are more representative.
  • The Duplicate detection model file format has been updated which has removed the need for a separate 'reference' file in order to save past training decisions. Compatibility with the old format has been retained, but using the new format adds many benefits for the user experience.
  • Bugfixes:
  • A thread starvation issue was fixed in DataCleaner monitor. The impact of this issue was great, but it happened only in rare and very customized cases. If custom listener objects on the DataCleaner monitor would throw an error, it would result in a resource never being freed up and taking up a thread from the Quartz-scheduling pool on the server. If this would happen many times the server could eventually run out of threads in that pool.
  • The vertical menu on the result screen is now doing a proper job of displaying the labels of the components that have results. This makes it easier to recognize which menu item points to what result item.

New in DataCleaner 4.0.7 (Sep 22, 2015)

  • The main improvements made in this release pertain to the display of analysis results:
  • We have changed the layout of the screen so that results are organized vertically to the left instead of as tabs above. The left-side menu can be collapsed and expanded to maximize readability.
  • The 'Duplicate detection' function now allows you to export the duplicated records and pairs into any writeable Datastore you might have (whereas it used to be just Staging Tables and Excel Spreadsheets). This way the storage needed to perform deduplication can become more consolidated and be fit to your own liking.
  • The size of the result window is now remembered so that your preferred window size is retained.
  • Here's a screenshot of the new result screen layout and the new export functionality in Duplicate detection:
  • Another important feature we've enabled with this release is component documentation in the application itself. Double-click any component and then the new 'Documentation' button to display it's component reference page. This is very helpful for discovering and learning about the capabilities within DataCleaner.

New in DataCleaner 4.0.5 (Sep 22, 2015)

  • Following our DataCleaner 4.0 release little over a month ago we have received an impressive amount of feedback. As with any major software release, the feedback certainly sparks many creative ideas and also makes us aware of things to improve. So thank you all for that.
  • It's because of the great feedback that we can today announce the availability of DataCleaner 4.0.5. This version of DataCleaner adds on top of the existing functions and features in DataCleaner 4.0, making them even more powerful. There is obviously also a number of minor bugfixes included in this release. Let's walk you through it:
  • Combined component requirements:
  • It's now possible to combine many component requirements into one. This especially makes sense if you have a graph of validation/correction tasks and you wish to catch all invalid entries into the same "bucket" of rejected records or so. Here's an example:
  • Search in component library:
  • We've added a search box to the component library of DataCleaner desktop. This makes it a lot easier to locate the component you're thinking of or to find components of relevance to what you have in mind.
  • Results from non-analyzers:
  • Until now it has been so that only components of the technical type 'Analyzer' can produce a result. This made a clear distinction between the tasks of data correction/transformation and tasks that produced reports/results that could be displayed to the user. We have relaxed this distinction a bit, allowing transformation components also to produce a result. For now we only have a few examples of this ('Table lookup' and 'Country standardizer'), but more will certainly come in the future.
  • Less file-management in Duplicate detection:
  • The configuration of the very popular Duplicate detection component was made a bit simpler by no longer requiring the user to consider file-location of the duplicate matching model. Now this file location is based on a default (which can of course be overridden by the user if wanted).
  • And much more:
  • More than 10 minor bugfixes was addressed. A helping "Component description" documentation option was added, as well as improving the general reference documentation which now holds more tutorials and explanations of all the functionality in DataCleaner.

New in DataCleaner 4.0.1 (Apr 15, 2015)

  • Visual graph-based job building:
  • A new visual way of building jobs instead of navigating though component tabs makes all of the difference for the user and his capabilities. In DataCleaner 4.0 you benefit from a clear picture of how your data is going to be processed.
  • The canvas not only shows the contents of your job. It also provides hints and guidance while you build it. Modifying the job is a matter of interacting with the nodes in the graph.
  • All the components "within a click’s distance":
  • In order to find data quality functions easier, the "Transform-Analyze" menu was made into an easy-to-navigate part of the tree structure on the left side of the application. The categories have been divided into "Transform-Improve-Analyze-Write" which makes for a clearer separation of components based on the type of task they help you with. To add a function - just drag it onto the job graph canvas.
  • Quick Start Wizards:
  • New users will benefit from the welcome screen guiding them how to make the first steps in the application. Commercial editions (read more) of DataCleaner include Quick Start Wizards that will answer the questions you might have about your data. Instead of manually assembling a job, the wizard asks a couple of questions and generates a job you can start your journey with. Such a job can be tweaked later on, if needed or just executed to see the insights.
  • Welcome screen:
  • Along with the wizards, the new welcome screen also changes the way new jobs are built. Click “New job from scratch” button to make use of drag-and-drop support or "Manage datastores" to work in a way known from previous versions of DataCleaner.
  • Refreshed look and feel:
  • The visual part of the user interface (icons, colors etc.) has been revisited. The new clean and modern look should make working with DataCleaner more pleasant.
  • UK/US/DE Address Correction and Suppression features:
  • DataCleaner has new components that integrate with UK, US and German address correction and suppression services from our partners. Now, without leaving DataCleaner's job workflow you can consult external databases for information about movers, do-not-mail declarations and verify the accuracy of address details.
  • Improvements to deduplication:
  • Deduplication scenarios have been reconsidered. Try our new “Untrained detection” mode for instant results with minimum configuration. The previous "Training Tool" and "Duplicate Detection" functions have been merged into one component. This eliminates the necessity of replacing Training Tool with Duplicate Detection on the way in order to fulfil the whole customized deduplication process.
  • ElasticSearch and Apache Cassandra connectivity:
  • We continue to expand our portfolio of supported databases. In the DataCleaner 4.0 release we are happy to announce that we now support two new NoSQL databases: ElasticSearch (read+write) as well as Apache Cassandra (readonly).

New in DataCleaner 3.7.2 (Dec 11, 2014)

  • The main concern that was addressed in this release was around loading of extensions/plugins. We fixed several issues pertaining to the loading sequence of and visibility of objects within extensions and the main distribution of DataCleaner. If you use extensions, we advise you to upgrade.
  • Furthermore an improvement to the "Capture Changed Records" filter was introduced - allowing it to work on numerical record version attributes instead of just update timestamp attributes. Lastly the license checking functionality of DataCleaner commercial editions was improved, making it easier to determine what is wrong when a license check is not successful.

New in DataCleaner 3.7 (Oct 31, 2014)

  • Connect:
  • We've added connectivity in DataCleaner to Apache HBase and JSON files. Apache HBase is a popular Hadoop database, a distributed, scalable, big data store. JSON is a data representation format that is becoming increasingly popular for Web technologies, web services and NoSQL databases.
  • Check:
  • The analytical capabilities of DataCleaner have also been improved. We’ve added an efficient Unique Key check feature. This allows you to easily and quickly check for duplicate keys (or other expected unique values) in your datasets.
  • Consolidate:
  • Talking about duplicates, the Duplicate Detection feature of DataCleaner professional edition has been improved in many ways. We’ve made several improvements to the user interface, making more options available for the advanced users. We’ve also published an online video tutorial to get people started. On the technical side, the deduplication model is now represented in a more readable XML format and the algorithm for detecting initial duplicates for training has been improved.

New in DataCleaner 3.6.2 (Jul 18, 2014)

  • We've made several improvements to the Duplicate Detection feature. Several minor bugs where fixed and matching quality was improved - both for the initial "potential duplicates" training set generation, and for the final building of matching rules.
  • The progress bar of a running job in the desktop UI has been beautified and made more interactive - it will set colors and update itself while the job is running.
  • In clustered setups, jobs can now be cancelled across the cluster. No more waiting for all the slave instances to finish their jobs - they will cancel within seconds if the master node tells them to.
  • We've added transformations for URL encoding and HTML encoding. For usages of DataCleaner where strings are being prepared for insertion into URLs or web sites, this is a great utility.
  • For DataCleaner enterprise edition, our Hadoop integration is being improved a lot. We have fixed several minor issues here.
  • Datastores configured in the desktop UI are now automatically persisted in the conf.xml file, making it easier to manage datastores also outside of the UI.
  • A bug pertaining to the "Merge Duplicates" feature from EasyDQ was fixed.

New in DataCleaner 3.6.1 (Jul 18, 2014)

  • For users that want to do transformations quick and simply write the results somewhere, we've now allowed any job to be executed, even without any analyzers.
  • In the DataCleaner monitor webapp, a critical bug was fixed which caused Linux deployments to treat the example 'DC' tenant's repository with a wrong filename. This has been fixed and the example tenant is now called 'demo'.
  • A new triggering mode has been introduced to the monitoring and scheduling functionality: One-time triggering. Using a single date and time instant, you can now get a job triggered once if needed.
  • The styling and javascript API of the DataCleaner monitor webapp has received several updates.
  • A user role "ROLE_GOD" was introduced, allowing certain users to have control over all tenants in the DataCleaner monitor webapp.
  • A fix was implemented for the clustered execution mode, ensuring that execution chunks are ordered correctly depending on the capabilities and natural ordering of the underlying datastore.
  • Clustered jobs can now be cancelled throughout the cluster. This means that the master will inform all slaves that the job should be ended and resources made free again.

New in DataCleaner 3.6 (May 20, 2014)

  • Duplicate Detection:
  • With DataCleaner 3.6 we are finally launching a new and extensive Duplicate Detection feature. With Duplicate Detection you can apply fuzzy logic to identify the records in your data that are duplicate entries for the same real-life thing. Use it to identify duplicate customers, products or anything else of relevance. It’s a great way to improve data quality and to have better interactions with customers, co-workers etc.
  • Referential Integrity:
  • Another exiting new feature in DataCleaner 3.6 is the Referential Integrity analyzer. With this analyzer you can easily check the integrity between multiple tables in a single step. The analyzer works with tables from the same datastore, and even also with tables from different sources. This means that you can effectively cross-check data from disparate sources that may be out of sync and cause data quality issues.
  • More and better Progress Information:
  • We’ve also done a lot to improve the responsiveness of DataCleaner’s screens while processing large jobs. The loading indicators and progress logs are now more responsive, and the layout of results has changed from being table oriented to result-type oriented. All in all it gives a nicer, more smooth experience with more overview of what is going on.
  • Commercial Editions revisited:
  • Finally, the offering of commercial editions of DataCleaner has been changed to fit better with individuals and professionals using DataCleaner. Now you can get support and professional edition features for a very low entry price. This we believe will fit the marketplace well and provide an awesome commercial open source Data Quality solution that is approachable for everyone.
  • The documentation for DataCleaner 3.6 has also been updated quite a lot and applies to both community and professional edition

New in DataCleaner 3.5.10 (Mar 15, 2014)

  • You can now compose jobs so that a DataCleaner job actually calls/invokes another "child" job as a single transformation. This is an important feature because it allows users to organize and compose complex data processing flows into smaller chunks of work. The new "Invoke child Analysis Job" transformation inlines the transformation section of the child job at execution time, which means that there is practically no overhead to this approach.
  • As a convenience for the above scenario, it is now allowed to save jobs without any analysis section in them. These jobs will thus be "incomplete", but that might actually be the point when composing and putting jobs together.
  • Another new transformation was added: Coalesce multiple fields. This transformation is useful for scenarios where multiple sets of fields are interchangeable, or when multiple interchangeable transformations produce the same set of fields. The "coalesce" transformation can roughly be translated into "pick the first non-empty values". When there's multiple sets of fields in your data processing stream, for instance multiple address definitions, and you need to select just one, then this is very convenient.
  • The handling of source columns has been simplified. Previously we tried to limit the source queries based upon only the source columns that where strictly needed to perform the analysis. But many users gave us the feedback that this caused trouble because the drill-to-detail information available in the analysis results would then be missing important fields for further exploration. So the power is now in the hands of the users: The fields added in the "Source" section of the job are the fields that will be queried
  • A change was made to the execution engine in dealing with complex filtering and requirement configurations. Previously, if a component (transformation or analysis) consumed inputs from other components, ALL requirements had to be satisfied, which mostly just causes the requirement to never become true. Now the logic has been changed to be inclusive so that if any of the direct input sources' requirements are satisfied, then the component's inferred requirement is also satisfied. Most users will not notice this change, but it does mean that it is now possible to merge separate filtered data streams back into a single stream
  • An issue was fixed in the access to repository files. Read/write locking is now in place which avoids access conflicts by different processes
  • The 'requirement' button in DataCleaner has also been reworked. It did not always properly respond to changes in other panels, but now it is consistent
  • Finally, the 'About' dialog was improved slightly and now contains more licensing information

New in DataCleaner 3.5.7 (Mar 15, 2014)

  • A flag has been added to the CSV datastore options, making it possible to disable values in CSV files that span multiple lines. Disabling this feature in our CSV parser enabled us to increase parsing speed significantly and at the same time handle poorly/inconsistently formatted CSV files much better. Since many CSV files anyway don't contain values that would be allowed to span multiple lines, we think this is a great way to gain the extra performance and stability
  • A change was made to the way we monitor progress log information. This means that we now have a much more effective and performant way to monitor progress of DataCleaner jobs, which especially speeds up performance on the server side
  • A minor modification to the progress logs have been implemented: The progress information statements now always shows the time of the statement
  • A minor bug was fixed: The CSV datastore dialog of the monitor web application would sometimes show an unexpected error if you did not fill out escape characters, quote characters and so on

New in DataCleaner 3.5.6 (Nov 22, 2013)

  • The monitoring webapp's CSV datastore dialog now supports TXT files as well as CSV and TSV files.
  • A bug was fixed pertaining to the "Max rows" filter's tab in the UI sometimes making uncloseable tabs for other components as well.
  • A bug was fixed causing sometimes the order of selected input columns of a component to not be retained when saving and loading the job.
  • Various improvements to API and stability of internal utilities.

New in DataCleaner 3.5.5 (Oct 25, 2013)

  • The 'Synonym lookup' transformation now has a option to look up every token of the input. This is useful if you're doing replacement of synonyms within the values of a long text field.
  • Blocking execution of DataCleaner jobs through the monitor's web service for this could sometimes fail with a bug caused by the blocking thread. This issue has been fixed.
  • An improvement was made in the way jobs and the sequence of components are closed / cleaned up after execution.
  • The JNLP / Java WebStart version of DataCleaner was exposed by a bug in the Java runtime causing certain JAR files not to be recognized by the WebStart launcher, under certain circumstances. This issue has been fixed by making slight modifications to those JAR files.
  • A few dead links in the documentation was fixed.

New in DataCleaner 3.5.4 (Sep 5, 2013)

  • It is now possible to hide output columns of transformations. Hiding will not affect the processing flow at all, but simply hide them from the user interface, and thus potentially making the experience more clean, when interacting with other components.
  • A new web service has been added to the monitoring web application, which provides a way to poll the status of the execution of a particular job.
  • A bug was fixed, causing the HTML report to fail for certain analysis types when no records had been processed.
  • And 6 other minor bug has been adressed.

New in DataCleaner 3.5.3 (Jul 1, 2013)

  • A bug was fixed which cased the DataCleaner monitor to show a result link for all jobs, even if they didn't produce a result. This only happened rarely though, for instance when building a custom Java job that returns null.
  • An advanced JavaScript transformer was added to the portfolio of built-in transformations. Using this transformer the user can build a stateful JavaScript object which is capable of both transforming, aggregating and filtering records.
  • Job and Datastore wizards now have 'Back' buttons.
  • A new dedicated 'extensions' folder is available in the DataCleaner desktop application. Use this folder to dump extension JAR files in, if you want them to be automatically loaded during application startup.
  • A new service was added to DataCleaner monitor, which enables administrators to download and upload (backup and restore) a complete monitoring repository in one go.
  • A bug was fixed which caused the desktop application's "DataCleaner monitor" dialog to crash when using default user preferences.

New in DataCleaner 3.5.1 (Jun 12, 2013)

  • Capture changed records:
  • A new filter was added to enable incremental processing of records that have not been processed before, e.g. for profiling or copying only modified records. The new filters's name is Capture changed records, referring to the concept of Change data capture.
  • Queued execution of jobs:
  • The DataCleaner monitor will now queue the execution of the same job, if it is triggered multiple times. This ensures that you don't accidentally run the same job concurrently which may lead to all sorts of issues, depending on what the job does.
  • Minor bugfixes:
  • Several bugfixes was implemented, see the full list on the 3.5.1 milestone page on our bugtracker.

New in DataCleaner 3.5 (Jun 12, 2013)

  • Connectivity to Salesforce and SugarCRM:
  • One of the most important sources of data is usually a company's CRM system. But it is also one of the more troublesome data sources if you look at the quality. For this reason we've made it easier to get the data out of these CRM systems and into DataCleaner! You can now use your Salesforce.com or your local SugarCRM system as if it was a regular database. Start by profiling the customer data to get an overview. But don't stop there - you can even use DataCleaner to also update your CRM data, once it is cleansed. More details are available in the brand new focus article about CRM data quality.
  • Wizards and other user experience improvements:
  • The DataCleaner monitor is our main user interface going forward. So we want the experience to be at least as pleasant, flexible and rich as the desktop application. To meet this goal, we've made many user interface and user experience improvements, amongst others:
  • Several wizards are now available for registering datastores; including file-upload to the server for CSV files, database connection entry, guided registration of Salesforce.com credentials and more.
  • The job building wizards have also been extended with several enhanced features; Selection of value distribution and pattern finding fields in the Quick analysis wizard, a completely new wizard for creating EasyDQ based customer cleansing jobs and a new job wizard for firing Pentaho Data Integration jobs.
  • You can now ad-hoc query any datastore directly in the web user interface. This makes it easy to get quick or sporadic insights into the data without setting up jobs or other managed approaches of processing the data.
  • Once jobs or datastores are created, the user is guided to take action with the newly built object. For instance, you can very quickly run a job right after it's built, or query a datastore after it is registered.
  • Administrators can now directly upload jobs to the repository, which is especially handy if you want to hand-edit the XML content of the job files.
  • A lot of the technical cruft is now hidden away in favor of showing simple dialogs. For instance, when a job is triggered a large loading indicator is shown, and when finished the result will be shown. The advanced logging screen that was previously there can still be displayed upon clicking a link for additional details.
  • Distributed execution of jobs:
  • To keep up with the massive amounts of data that many organizations are juggling with today, we had to take a critical look at how we process data in DataCleaner. Although DataCleaner is among the fastest data processing tools, it was previously limited to running on a single machine. For a long time we've been working on a major architecture change that enabled distribution of a DataCleaner job's workload over a cluster of machines. With this new approach to data processing, DataCleaner is truly fit for data quality on big data.
  • Data visualization extension:
  • Data profiling and data visualization do share some common interests - both are disciplines that help you understand the story that your data is telling. There are obviously also some differences, mainly being that data profiling is more targeted at identifying issues and exceptions rather than deriving or measuring business objectives. But confronted with visualization tools we've realized that sometimes there's a lot of profiling value in progressively visualizing data. For instance, a scatter plot can easily help you identify the numerical outliers of your datasets. This idea gave fuel to the idea of a visualization extension to DataCleaner. Therefore DataCleaner now also let's you do basic visualization tasks to aid you in your data quality analysis.
  • National identifiers extension:
  • A very common issue in data quality projects is to validate national identifiers, such as social security numbers, EAN codes and more. In our commercial editions of DataCleaner, we now offer a wide range of validation components to check such identifiers.
  • Custom job engines:
  • We've made the ultimate modularization of the DataCleaner monitoring system: The engine itself is a pluggable module. While we do encourage to use DataCleaner's engine as the primary vehicle for execution in DataCleaner monitor, it is not obligatory anymore. You can now schedule and monitor (both in terms of metric monitoring and history management) other types of jobs. For instance, you can provide your own piece of Java code and have it scheduled to run in DataCleaner monitor using the regular web user interface.
  • Pentaho job scheduling and execution:
  • One major example of a pluggable job engine was introduced that we think deserves special attention: You can now invoke and monitor execution metrics of Pentaho Data Integration transformations. DataCleaner monitor by default ships with this job engine extension which connects to the Pentaho DI server ("Carte") and supervises the execution and result gathering of it. After execution you can track your Pentaho transformations in the timeline views of the monitoring dashboard, just like other metrics. For larger deployments of DataCleaner it may be convenient with dedicated ETL-style jobs in your data quality solution, and with this extension we provide an integration with a leading open source solution for just that.

New in DataCleaner 3.1.2 (Jan 23, 2013)

  • We've added a web service in the monitoring application for getting a (list of) metric values. This makes the monitoring even more usable as a key infrastructure component, as a way to monitor data (quality) and expose the results to third party applications
  • The 'Table lookup' component has been improved by adding join semantics as a configurable property. Using the join semantics you can tweak if you wish the lookup to work semantically like a LEFT JOIN or an INNER JOIN.
  • The EasyDQ components have been upgraded, adding further configuration options and a richer deduplication result interface.
  • Performance improvements have been a specific focus of this release. Improvements have been made in the engine of DataCleaner to further utilize a streaming processing approach in certain corner cases which was not covered previously.

New in DataCleaner 3.0.3 (Dec 14, 2012)

  • We've added a service for renaming jobs in the monitoring repository. You can access this as a RESTful web service or interactively in the UI:
  • A web service was added for changing the historic date of an analysis result in the monitoring repository. This is convenient if you have historic dumps of data that you wish to include in a timeline.
  • The documentation has been updated with more elaborate descriptions of the web services available for repository navigation, job invocation and more.
  • The login dialog in the desktop application had a low-level version conflict, which caused it to be unusable. This has been fixed.
  • The web application has been made compatible with legacy JSF containers, making the range of applicable Java Webservers wider.
  • Caching of configuration in the web application was greatly improving, leading to faster page load and job initialization times.

New in DataCleaner 3.0.2 (Oct 12, 2012)

  • When triggering a job in the monitoring web application, the panel auto-refreshes every second to get the latest state of the execution.
  • File-based datastores (such as CSV or Excel spreadsheets) with absolute paths are now correctly resolved in the monitoring web application.
  • The "Select from key/value map" transformer now supports nested select expressions like "Address.Street" or "orderlines[0].product.name".
  • The table lookup mechanism have been optimized for performance, using prepared statements when running against JDBC databases.
  • Administrators can now download file-based datastores directly from the "Datastores" page.
  • Exception handling in the monitoring web application has been improved a bit, making the error messages more precise and intuitive.

New in DataCleaner 3.0.1 (Oct 2, 2012)

  • The primary bugfix in this release was about restoring the mapping of columns and specific enumerable categorizations. For instance in the new Completeness analyzer, we found that after reloading a saved job, the mapping was not always correct.
  • Furthermore a few internal improvements have been made, making it easier to deploy the DataCleaner monitor web application in environments using the Spring Framework.
  • Last but not least, the visualization settings in the desktop application have been improved by automatically taking a look at the job being visualized and toggling displayed artifacts based on the screen size and amount of details needed to show it nicely.

New in DataCleaner 3.0 (Sep 20, 2012)

  • A data quality monitoring web application.
  • A multi-tenant repository for data quality artifacts (jobs, profiling results, configurations, datastore definitions etc.)
  • Being able to edit data (in the desktop application).
  • Wizards to guide users through their first-time user experience with DataCleaner.

New in DataCleaner 2.5.2 (May 2, 2012)

  • Apache CouchDB support
  • Update table writer
  • Drill-to-detail information saved in result files
  • Improved EasyDQ error handling
  • Table mapping for NoSQL datastores

New in DataCleaner 2.5.1 (Apr 11, 2012)

  • A bug was fixed in the Table lookup transformation, which caused it to be unable to have multiple output columns.
  • CSV file escape characters have been made configurable.
  • A minor bug pertaining to empty strings in the Concatenator was fixed.
  • Support for the Cubrid database was added.
  • The converter transformations was adapted to be able to work on multiple fields, not just single fields.

New in DataCleaner 2.4.1 (Jan 3, 2012)

  • Feature enhancements:
  • Batch loading features we're greatly improved when writing data to database tables. Expect to see many orders of magnitude improvements here.
  • Writing to data has been more conveniently made available by adding the options to the window menu.
  • You can now easily rename components of a job by double clicking their tabs.
  • The Javascript transformer now has syntax coloring, so that your Javascripts are easier to inspect and modify.
  • Bugfixes:
  • When reading from and writing to the same datastore (eg. the DataCleaner staging area) we've made sure that the table cache of that datastore is refreshed. Previously some scenarios allowed you to see an out-of-date view of the tables.
  • A potential deadlock when starting up the application was solved. This deadlock was a consequence of an issue in the JVM, but we worked around it by synchronizing all calls to the particular API in Java.

New in DataCleaner 2.2 (Jun 27, 2011)

  • Extensibility:
  • The main driver for this release has been a story about extensibility. While releasing the application we are simultaniously releasing a a new DataCleaner website which features an important new area: The ExtensionSwap. The idea of the ExtensionSwap is to allow sharing of extensions to DataCleaner and installation simply by clicking a button in the browser!
  • The DataCleaner extension API has been improved a lot in this release, making it possible to create your own transformers, analyzers and filters. If you feel your extensions could be of interest to other users, please share it on the ExtensionSwap and we provide a channel for you to easily distribute it to thousands of users. The Extension API and the ExtensionSwap is further explained in our new webcast demonstration for developers and other techies with an interest.
  • We are also releasing a set of initial extensions on the ExtensionSwap: The HIquality Contacts for DataCleaner extension which provides advanced Name, Phone and Email cleansing, based on Human Inferences natural language processing DQ web services. We are also shipping a sample extension which will serve as an example for developers wanting to try out extension development themselves. In the coming months we will make sure to post even more extensions originating from our internal portfolio of tools that we use at Human Inference's knowledge gathering teams.
  • In addition to extensibility we are also focusing on embeddability. We want to be able to embed DataCleaner easily into other applications to make profiling and data analysis possible anywhere! We've created a new bootstrapping API which allows applications to bundle DataCleaner and bootstrap it with a dynamic configuration or run it in a "single datastore mode", where the application is tuned towards just inspecting a single datastore (typically defined by the application that embeds DataCleaner). We already have some really interesting cases of embedding DataCleaner in the works - both in other open source applications as well as commercial applications.
  • Compatibility:
  • We've added support for analyzing SAS data sets. This is something we're quite proud of as we are, to our knowledge, the first major open source application to provide such functionality, ultimately liberating a lot of SAS users. The SAS interoperability part was created as a separate project, SassyReader, so we expect to see adoption in DataCleaner's complimentary open source communities soon too!
  • We've also added support for another type of datastore: Fixed width files. Fixed width files are text files where each column has a fixed width. There is no separator or quote character, like CSV files, instead each line are equal in length and each line will be tokenized according to a set of value lengths.
  • An option to "fail on inconsistencies" was added to CSV file and fixed width file datastores. These flags add a format integrity check when using these text file based datastores.
  • A bug was fixed, which caused CSV separator settings not to be retained in the user interface, when editing a CSV datastore.
  • Japanese and other characters are not supported in the user interface. This "bug" was a matter of investigating available fonts on the system and selecting a font that can render the particular characters. On most modern systems there will be capable fonts available, but on some Unix/Linux branches there might still be limitations.
  • Other improvements:
  • The documentation section has been updated! Ever since the initial 2.0 release the documentation have been far behind, but we've finally managed to get it up to date. There are still pieces missing in the docs, but it should definately be useful for basic usage as well as a reference for most topics.
  • Application startup time was improved by parallelizing the configuration loading and by delaying the initialization of those parts of the configuration that are not needed for the initial window display.
  • The phonetic similarity finder analyzer have been removed from the main distribution, as this was quite experimental and serves mostly as a proof of concept and an appetizer to the community to create more advanced matching analyzers. You can now find and install the phonetic similarity finder on the ExtensionSwap.
  • Cancelled or errornous job handling was improved and the user interface responds more correctly by disabling buttons and progress indicators, if a job has stopped.
  • Fixed a few minor UI issues pertaining to table sizing and use of scrollbars.

New in DataCleaner 2.1 (Apr 19, 2011)

  • There was a lot of work done on the user interface (see media page):
  • We decided to remove the left-hand side window containing environment configuration options.
  • Instead all these options have now been moved to the job building window so the user only has to focus on a single window for all the interactions needed to build a job.
  • The welcome/login dialog has also been removed in favor of a more discrete panel that can be pulled in or hidden from the main window.
  • Datastore selection and management is considered the first activity in the application, which is why it is also the first step to handle in the main window.
  • You can now stop jobs in case you decide to change something before it is done.
  • Bar and line charts were added to a lot of the analysis result screens, including String analyzer, Number analyzer, Date/time analyzer and Weekday distribution (see media page).
  • All "preview data" windows now contain paging controls so you can move backwards and forwards in the data set.
  • Most common database drivers (MySQL, PostgreSQL, Oracle, MS SQL Server and Sybase) have been added to a default set of drivers.
  • Configuration of the Quick analysis function in the Options dialog.
  • Various minor bugfixes.
  • Transformer for extracting date parts (year, month, day etc.) from date columns.