What's new in Norconex HTTP Collector 3.0.2

Jul 9, 2023
  • Fixed GenericSitemapResolver NPE when the sitemap content-type could not be detected. #803
  • Updated Maven dependency updates: norconex-commons-maven-parent 1.0.2, norconex-collector-core 2.0.2, norconex-importer 3.0.1, Guava 32.0.0-jre, Selenium 4.0.0, Jetty 9.4.51.v20230217.

New in Norconex HTTP Collector 3.0.1 (Aug 30, 2022)

  • New: MDC attributes which can be used in supporting logging framework: "ctx:crawler.id", "ctx:crawler.id.safe", "ctx:collector.id.safe", and "ctx:collector.id.safe". #790
  • Fixed: Fixed invalid relative URLs resolution when such URLs contain a colon (:) that is not the scheme. #788
  • Fixed: Fixed not always considering effective top level domain properly in HSTS resolution. #785
  • Fixed: Fixed occasional concurrency issue when crawler terminates. #781
  • Fixed: Fixed the crawler sometimes not exiting when done.

New in Norconex HTTP Collector 2.9.1 (Dec 1, 2021)

  • New: New "validExitCodes" on PhantomJSDocumentFetcher to support other than 0 (now default). #729
  • New: StandardSiteMapResolver can now optionally filter entries that are older than a given date. #718
  • New: StandardSiteMapResolver can now optionally throw exception instead of logging errors. #718
  • Updated: Robots.txt redirects are now being followed. #726
  • Updated: Extracted links are no longer extracted/kept by default for documents having reached the max depth. To keep former behavior, use the new method HttpCrawlerConfig#keepMaxDepthLinks(...). #498
  • Updated: Generic canonical link detector now supports non-standard canonical URL definition using meta "property" instead of "rel". #697
  • Updated: More permissive gzip detection for sitemap.xml files. #687
  • Updated: Can now decide to re-process headers as part of document fetching when metadata fetcher fails. #655
  • Updated: Redirects are now ignored when part of a bad HTTP response when "skipOnBadStatus" is "true". #695
  • Updated: Metadata fetcher exceptions are swallowed and logged when "skipOnBadStatus" is "true". #696
  • Updated: Unsupported content encoding form Content-Type HTTP headers now triggers a detection attempt of both encoding and content type based on document content.
  • Updated: Norconex JEF 4.1.3, Norconex Commons Lang 1.15.2, Norconex Collector Core 1.10.1.
  • Fixed: Fixed sitemap.xml URL entries not being extracted when "loc" is from a sitemap extension namespace (e.g., "image:loc"). #761
  • Fixed: Fixed sitemap.xml URL entries not being extracted when they also contain images. #758
  • Fixed: Fixed RecrawlableResolver settings not always being respected. #741
  • Fixed: Fixed redirect targets sometimes being processed again when already processed normally (not through a redirect). #741
  • Fixed: Fixed NullPointerException when resolving sitemaps. #738
  • Fixed: Fixed authFormParams XML configuration not being loaded properly. #730
  • Fixed: Fixed occasional FileNotFoundException with PhantomJS when the downloaded file was not yet written even if PhantomJS has already returned.
  • Fixed: Fixed REJECTED_REDIRECTED event not being fired for the redirect source URL when the target URL was previously encountered. #678
  • Fixed: Unsupported content encoding form Content-Type HTTP headers no longer rejects a document.

New in Norconex HTTP Collector 2.9.1-20210712 Snapshot (Jul 12, 2021)

  • Maintenance release

New in Norconex HTTP Collector 2.9.1-20210412 Snapshot (Apr 12, 2021)

  • New "validExitCodes" on PhantomJSDocumentFetcher to support other than 0 (now default).
  • New StandardSiteMapResolver can now optionally filter entries that are older than a given date.
  • New StandardSiteMapResolver can now optionally throw exception instead of logging errors.
  • Updated Robots.txt redirects are now being followed.
  • Updated Extracted links are no longer extracted/kept by default for documents having reached the max depth. To keep former behavior, use the new method HttpCrawlerConfig#keepMaxDepthLinks(...).
  • Updated Generic canonical link detector now supports non-standard canonical URL definition using meta "property" instead of "rel".
  • Updated More permissive gzip detection for sitemap.xml files.
  • Updated Can now decide to re-process headers as part of document fetching when metadata fetcher fails.
  • Updated Redirects are now ignored when part of a bad HTTP response when "skipOnBadStatus" is "true".
  • Updated Metadata fetcher exceptions are swallowed and logged when "skipOnBadStatus" is "true".
  • Updated Unsupported content encoding form Content-Type HTTP headers now triggers a detection attempt of both encoding and content type based on document content.
  • Updated Norconex JEF 4.1.3, Norconex Commons Lang 1.15.2, Norconex Collector Core 1.10.1.
  • Fixed RecrawlableResolver settings not always being respected.
  • Fixed redirect targets sometimes being processed again when already processed normally (not through a redirect).
  • Fixed NullPointerException when resolving sitemaps.
  • Fixed authFormParams XML configuration not being loaded properly.
  • Fixed occasional FileNotFoundException with PhantomJS when the downloaded file was not yet written even if PhantomJS has already returned.
  • Fixed REJECTED_REDIRECTED event not being fired for the redirect source URL when the target URL was previously encountered.
  • Fixed Unsupported content encoding form Content-Type HTTP headers no longer rejects a document.

New in Norconex HTTP Collector 2.9.1-20201214 Snapshot (Jan 15, 2021)

  • New:
  • "validExitCodes" on PhantomJSDocumentFetcher to support other than 0 (now default).
  • StandardSiteMapResolver can now optionally filter entries that are older than a given date.
  • StandardSiteMapResolver can now optionally throw exception instead of logging errors.
  • Updated:
  • Robots.txt redirects are now being followed.
  • Extracted links are no longer extracted/kept by default for documents having reached the max depth. To keep former behavior, use the new method HttpCrawlerConfig#keepMaxDepthLinks(...).
  • Generic canonical link detector now supports non-standard canonical URL definition using meta "property" instead of "rel".
  • More permissive gzip detection for sitemap.xml files.
  • Can now decide to re-process headers as part of document fetching when metadata fetcher fails.
  • Redirects are now ignored when part of a bad HTTP response when "skipOnBadStatus" is "true".
  • Metadata fetcher exceptions are swallowed and logged when "skipOnBadStatus" is "true".
  • Unsupported content encoding form Content-Type HTTP headers now triggers a detection attempt of both encoding and content type based on document content.
  • Norconex JEF 4.1.3, Norconex Commons Lang 1.15.2, Norconex Collector Core 1.10.1.
  • Fixed:
  • Fixed occasional FileNotFoundException with PhantomJS when the downloaded file was not yet written even if PhantomJS has already returned.
  • Fixed REJECTED_REDIRECTED event not being fired for the redirect source URL when the target URL was previously encountered.
  • Unsupported content encoding form Content-Type HTTP headers no longer rejects a document.

New in Norconex HTTP Collector 2.9.1-20201109 Snapshot (Nov 9, 2020)

  • New:
  • StandardSiteMapResolver can now optionally filter entries that are older than a given date. #718
  • StandardSiteMapResolver can now optionally throw exception instead of logging errors. #718
  • Updated:
  • Extracted links are no longer extracted/kept by default for documents having reached the max depth. To keep former behavior, use the new method HttpCrawlerConfig#keepMaxDepthLinks(...). #498
  • Generic canonical link detector now supports non-standard canonical URL definition using meta "property" instead of "rel". #697
  • More permissive gzip detection for sitemap.xml files. #687
  • Can now decide to re-process headers as part of document fetching when metadata fetcher fails. #655
  • Redirects are now ignored when part of a bad HTTP response when "skipOnBadStatus" is "true". #695
  • Metadata fetcher exceptions are swallowed and logged when "skipOnBadStatus" is "true". #696
  • Unsupported content encoding form Content-Type HTTP headers now triggers a detection attempt of both encoding and content type based on document content.
  • Norconex JEF 4.1.3, Norconex Commons Lang 1.15.2, Norconex Collector Core 1.10.1.
  • Fixed:
  • Fixed REJECTED_REDIRECTED event not being fired for the redirect source URL when the target URL was previously encountered. #678
  • Unsupported content encoding form Content-Type HTTP headers no longer rejects a document.

New in Norconex HTTP Collector 2.9.0 (Oct 26, 2020)

  • From Collector Core update, added "unmanaged" attribute to "logsDir" configuration option to prevent the collector from managing its own file-based logging. #593
  • Added new normalization rules to GenericURLNormalizer: removeQueryString, lowerCase, lowerCasePath, lowerCaseQuery, lowerCaseQueryParameterNames, lowerCaseQueryParameterValues. #479
  • Added "includeSubdomains" to start URL group to consider subdomains as being the same as any URL domain when "stayOnDomain" is true. #563
  • GenericLinkExtractor now supports using "selector-syntax" to identify portions of an HTML document to include or exclude from URL extraction.
  • Updated Switch default cookie specs from DEFAULT to STANDARD and can now be configured. #525
  • Updated Dependency updates: Norconex Collector Core 1.10.0, Norconex Commons Lang 1.15.1.
  • Updated Stateful support added for auth sessions (e.g., NTLM). #526
  • Fixed GenericCanonicalLinkDetector not handling properly HTTP Response "Link" Header with multiple values. #646
  • Fixed GenericRecrawlableResolver large frequency units resulting in those minimum frequencies not being respected. #597
  • Fix quoted mediatype from content type. #578
  • Fixed exceptions in URLCrawlScopeStrategy#isInScope(...) ending the crawler execution. #540
  • Discovered canonical URLs are now rejected if they are out of scope (not matching the stayOnXXX directives). #527
  • Fixed good redirected URLs being committed twice in same crawl session. #533
  • TikaLinkExtractor now extracts URL with spaces. It should also support other URLs typically considered "bad".

New in Norconex HTTP Collector 2.9.1 Build 20200727 Snapshot (Jul 27, 2020)

  • Can now decide to re-process headers as part of document fetching when metadata fetcher fails.

New in Norconex HTTP Collector 2.9.1 Build 20200608 Snapshot (Jun 8, 2020)

  • Maintenance release

New in Norconex HTTP Collector 2.9.0 Build 20190724 Snapshot (Jul 25, 2019)

  • From Collector Core update, added "unmanaged" attribute to "logsDir" configuration option to prevent the collector from managing its own file-based logging.
  • Added new normalization rules to GenericURLNormalizer: removeQueryString, lowerCase, lowerCasePath, lowerCaseQuery, lowerCaseQueryParameterNames, lowerCaseQueryParameterValues.
  • Added "includeSubdomains" to start URL group to consider subdomains as being the same as any URL domain when "stayOnDomain" is true.
  • GenericLinkExtractor now supports using "selector-syntax" to identify portions of an HTML document to include or exclude from URL extraction.
  • Switch default cookie specs from DEFAULT to STANDARD and can now be configured.
  • Dependency updates: Norconex Collector Core 1.9.2, Norconex Commons Lang 1.15.1.
  • Stateful support added for auth sessions (e.g., NTLM).
  • Fixed GenericRecrawlableResolver large frequency units resulting in those minimum frequencies not being respected.
  • Fix quoted mediatype from content type.
  • Fixed exceptions in URLCrawlScopeStrategy#isInScope(...) ending the crawler execution.
  • Discovered canonical URLs are now rejected if they are out of scope (not matching the stayOnXXX directives).
  • Fixed good redirected URLs being committed twice in same crawl session.
  • TikaLinkExtractor now extracts URL with spaces. It should also support other URLs typically considered "bad".

New in Norconex HTTP Collector 2.9.0 Build 20190519 Snapshot (May 20, 2019)

  • From Collector Core update, added "unmanaged" attribute to "logsDir" configuration option to prevent the collector from managing its own file-based logging. #593
  • Added new normalization rules to GenericURLNormalizer: removeQueryString, lowerCase, lowerCasePath, lowerCaseQuery, lowerCaseQueryParameterNames, lowerCaseQueryParameterValues. #479
  • Added "includeSubdomains" to start URL group to consider subdomains as being the same as any URL domain when "stayOnDomain" is true. #563
  • GenericLinkExtractor now supports using "selector-syntax" to identify portions of an HTML document to include or exclude from URL extraction.
  • Switch default cookie specs from DEFAULT to STANDARD and can now be configured. #525
  • Dependency updates: Norconex Collector Core 1.9.2, Norconex Commons Lang 1.15.1.
  • Stateful support added for auth sessions (e.g., NTLM). #526
  • Fixed GenericRecrawlableResolver large frequency units resulting in those minimum frequencies not being respected. #597
  • Fix quoted mediatype from content type. #578
  • Fixed exceptions in URLCrawlScopeStrategy#isInScope(...) ending the crawler execution. #540
  • Discovered canonical URLs are now rejected if they are out of scope (not matching the stayOnXXX directives). #527
  • Fixed good redirected URLs being committed twice in same crawl session.

New in Norconex HTTP Collector 2.9.0 Build 20190516 Snapshot (May 16, 2019)

  • From Collector Core update, added "unmanaged" attribute to "logsDir" configuration option to prevent the collector from managing its own file-based logging.
  • Added new normalization rules to GenericURLNormalizer: removeQueryString, lowerCase, lowerCasePath, lowerCaseQuery, lowerCaseQueryParameterNames, lowerCaseQueryParameterValues.
  • Added "includeSubdomains" to start URL group to consider subdomains as being the same as any URL domain when "stayOnDomain" is true.
  • GenericLinkExtractor now supports using "selector-syntax" to identify portions of an HTML document to include or exclude from URL extraction.
  • Switch default cookie specs from DEFAULT to STANDARD and can now be configured.
  • Dependency updates: Norconex Collector Core 1.9.2, Norconex Commons Lang 1.15.1.
  • Stateful support added for auth sessions (e.g., NTLM).
  • Fixed GenericRecrawlableResolver large frequency units resulting in those minimum frequencies not being respected.
  • Fix quoted mediatype from content type.
  • Fixed exceptions in URLCrawlScopeStrategy#isInScope(...) ending the crawler execution.
  • Discovered canonical URLs are now rejected if they are out of scope (not matching the stayOnXXX directives).
  • Fixed good redirected URLs being committed twice in same crawl session.

New in Norconex HTTP Collector 2.8.2 Build 20180102 Snapshot (Jan 3, 2019)

  • GenericLinkExtractor now supports using "selector-syntax" to identify portions of an HTML document to include or exclude from URL extraction.
  • Switch default cookie specs from DEFAULT to STANDARD and can now be configured. #525
  • Dependency updates: Norconex Collector Core 1.9.2-SNAPSHOT, Norconex Commons Lang 1.15.1.
  • Stateful support added for auth sessions (e.g., NTLM). #526
  • Fixed exceptions in URLCrawlScopeStrategy#isInScope(...) ending the crawler execution. #540
  • Discovered canonical URLs are now rejected if they are out of scope (not matching the stayOnXXX directives). #527
  • Fixed good redirected URLs being committed twice in same crawl session.

New in Norconex HTTP Collector 2.8.2 Snapshot (Nov 23, 2018)

  • Switch default cookie specs from DEFAULT to STANDARD and can now be configured.
  • Dependency updates: Norconex Collector Core 1.9.2-SNAPSHOT.
  • Discovered canonical URLs are now rejected if they are out of scope (not matching the stayOnXXX directives).

New in Norconex HTTP Collector 2.8.1 (Aug 20, 2018)

  • Dependency updates: Norconex Collector Core 1.9.1, Norconex Commons Lang 1.15.0.
  • GenericDocumentFetcher now logs documents it could not fetch as INFO.
  • Fixed XMLFeedLinkExtractor sometimes splitting URLs
  • Fixed invalid validation error when XML configuration for PhantomJSDocumentFetcher does not contain "screenshotStorageDiskDir".
  • Fixed NullPointerException when using PhantomJSDocumentFetcher without specifying any "screenshotDimensions".
  • Added validation support for new "authFormParams" option in GenericHttpClientFactory.
  • Fixed PhantomJSDocumentFetcher sending invalid scheme to PhantomJS proxy (when used with HttpClientProxyCollectorListener).

New in Norconex HTTP Collector 2.8.1 Build 20180611 Snapshot (Jul 24, 2018)

  • Dependency updates: Norconex Collector Core 1.9.1, Norconex Commons Lang 1.15.0.
  • Fixed XMLFeedLinkExtractor sometimes splitting URLs.Fixed PhantomJSDocumentFetcher sending invalid scheme to PhantomJS proxy (when used with HttpClientProxyCollectorListener).

New in Norconex HTTP Collector 2.8.1 Build 20180318 Snapshot (Mar 18, 2018)

  • Dependency updates: Norconex Collector Core 1.9.1, Norconex Commons Lang 1.14.1.
  • Fixed invalid validation error when XML configuration for PhantomJSDocumentFetcher does not contain "screenshotStorageDiskDir". #473

New in Norconex HTTP Collector 2.8.1 Build 20171226 Snapshot (Dec 26, 2017)

  • Dependency updates: Norconex Collector Core 1.9.1.
  • Fixed NullPointerException when using PhantomJSDocumentFetcher without specifying any "screenshotDimensions".
  • Added validation support for new "authFormParams" option in GenericHttpClientFactory.

New in Norconex HTTP Collector 2.8.1 Build 20171210 Snapshot (Dec 11, 2017)

  • Fixed PhantomJSDocumentFetcher sending invalid scheme to PhantomJS proxy (when used with HttpClientProxyCollectorListener).

New in Norconex HTTP Collector 2.8.1 Build 20171208 Snapshot (Dec 8, 2017)

  • Fixed NullPointerException when using PhantomJSDocumentFetcher without specifying any "screenshotDimensions".
  • Added validation support for new "authFormParams" option in GenericHttpClientFactory.

New in Norconex HTTP Collector 2.8.0 (Nov 27, 2017)

  • New FeaturedImageProcessor class to extract the "main" image from a web page.
  •  New screenshot storage options for PhantomJSDocumentFetcher, like specifying image format, store inline (base64), and exact size.
  •  New "authFormParams" option on GenericHttpClientFactory for adding arbitrary parameter to authentication forms.
  •  New metadata field "collector.redirect-trail" for storing all redirect source URLs with the target URL metadata. #397 
  •  New "authPreemptive" configuration option on GenericHttpClientFactory to use preemptive authentication with BASIC authentication. #420 
  •  GenericLinkExtractor can now limit link extraction to specified portion of text only.
  •  It is now possible to store as metadata URLs that are not "in-scope" with a new flag "keepOutOfScopeLinks". #360 
  •  New GenericHttpClientFactory#removeRequestHeader(String) method.
  •  New "resourceTimeout" option on PhantomJSDocumentFetcher to prevent individual page items to hang PhantomJS for a long time. #383 
  •  Now handles circular references between a non-canonical page pointing to a URL redirecting back to that non-canonical page. When this loop is encountered, it will now process the non-canonical page (used to reject both). #162 
  •  GenericLinkExtractor now extracts unquoted URLs. #423 
  •  *.http.doc.IHttpDocumentProcessor now deprecated in favor of *.http.processor.IHttpDocumentProcessor.
  • Crawler event REJECTED_CANONICAL now deprecated in favor of REJECTED_NONCANONICAL.
  • RegexLinkExtractor now supports replacement values for regex matches.
  • GenericURLNormalizer XML configuration can now accepts empty normalizations tag to blank default normalizations.
  • Dependency updates: Norconex Collector Core 1.9.0, Norconex Commons Lang 1.14.0.
  • PhantomJSDocumentFetcher will now try to write a document for processing even if its status from PhantomJS is not "success" (for cases where valid content could be extracted nonetheless). #383
  • Blank start URLs are now silently ignored.
  • Fixed blank URLs extracted by RegexLinkExtractorin throwing NullPointerException. #422
  • Fixed blank lines in start URL seed files throwing NullPointerException. #396 
  • Orphans being reprocessed no longer triggers the "RecrawlableResolverStage". #39
  • Fixed improper saving as XML of GenericDelayResolver.
  • Fixed GenericDelayResolver not taking into account ranges that overlap into the next cycle (e.g. 22:00 to 5:00).
  • Fixed setting proxy realm having no effect on GenericHttpClientFactory.
  • Fixed invalid validation error for "tempDir" element on StandardSitemapResolverFactory.
  • Fixed PhantomJSDocumentFetcher sometimes throwing a NullPointerException when content-type could not be established. #408 
  • Removed these deprecated classes: HtmlLinkExtractor, HttpMetadataChecksummer, and deprecated methods: HttpCrawler#getUrlsFiles, HttpCrawler#setUrlsFiles.

New in Norconex HTTP Collector 2.8.0 Build 20171112 Snapshot (Nov 13, 2017)

  • GenericLinkExtractor now extracts unquoted URLs.

New in Norconex HTTP Collector 2.8.0 Build 20171108 Snapshot (Nov 8, 2017)

  • RegexLinkExtractor now supports replacement values for regex matches.
  • GenericURLNormalizer XML configuration can now accepts empty normalizations tag to blank default normalizations.
  • Dependency updates: Norconex Collector Core 1.9.0, Norconex Commons Lang 1.14.0.
  • PhantomJSDocumentFetcher will now try to write a document for processing even if its status from PhantomJS is not "success" (for cases where valid content could be extracted nonetheless). #383
  • Blank start URLs are now silently ignored.
  • Fixed blank lines in start URL seed files throwing NullPointerException. #396
  • Orphans being reprocessed no longer triggers the "RecrawlableResolverStage". #390
  • Fixed improper saving as XML of GenericDelayResolver.
  • Fixed GenericDelayResolver not taking into account ranges that overlap into the next cycle (e.g. 22:00 to 5:00).
  • Fixed setting proxy realm having no effect on GenericHttpClientFactory.
  • Fixed invalid validation error for "tempDir" element on StandardSitemapResolverFactory.
  • Fixed PhantomJSDocumentFetcher sometimes throwing a NullPointerException when content-type could not be established. #408 
  • Removed these deprecated classes: HtmlLinkExtractor, HttpMetadataChecksummer, and deprecated methods: HttpCrawler#getUrlsFiles, HttpCrawler#setUrlsFiles.

New in Norconex HTTP Collector 2.8.0 Build 20171105 Snapshot (Nov 6, 2017)

  • New "authFormParams" option for adding arbitrary parameter to authentication forms.
  • Now handles circular references between a non-canonical page pointing to a URL redirecting back to that non-canonical page. When this loop is encountered, it will now process the non-canonical page (used to reject both). #162
  • *.http.doc.IHttpDocumentProcessor now deprecated in favor of *.http.processor.IHttpDocumentProcessor.
  • Crawler event REJECTED_CANONICAL now deprecated in favor of REJECTED_NONCANONICAL.

New in Norconex HTTP Collector 2.8.0 Build 20170929 Snapshot (Sep 29, 2017)

  • New FeaturedImageProcessor class to extract the "main" image from a web page.
  • New metadata field "collector.redirect-trail" for storing all redirect source URLs with the target URL metadata.
  • New "resourceTimeout" option on PhantomJSDocumentFetcher to prevent individual page items to hang PhantomJS for a long time.
  • *.http.doc.IHttpDocumentProcessor now deprecated in favor of *.http.processor.IHttpDocumentProcessor.
  • PhantomJSDocumentFetcher will now try to write a document for processing even if its status from PhantomJS is not "success" (for cases where valid content could be extracted nonetheless).
  • Blank start URLs are now silently ignored.
  • Fixed blank lines in start URL seed files throwing NullPointerException.
  • Orphans being reprocessed no longer triggers the "RecrawlableResolverStage".
  • Fixed invalid validation error for "tempDir" element on StandardSitemapResolverFactory.
  • Removed these deprecated classes: HtmlLinkExtractor, HttpMetadataChecksummer, and deprecated methods: HttpCrawler#getUrlsFiles, HttpCrawler#setUrlsFiles.

New in Norconex HTTP Collector 2.8.0 Build 20170803 Snapshot (Aug 4, 2017)

  • GenericLinkExtractor can now limit link extraction to specified portion of text only.
  • It is now possible to store as metadata URLs that are not "in-scope" with a new flag "keepOutOfScopeLinks". #360
  • New GenericHttpClientFactory#removeRequestHeader(String) method.
  • RegexLinkExtractor now supports replacement values for regex matches.
  • GenericURLNormalizer XML configuration can now accepts empty normalizations tag to blank default normalizations.
  • Dependency updates: Norconex Collector Core 1.8.3, Norconex Commons Lang 1.14.0.
  • Fixed improper saving as XML of GenericDelayResolver.
  • Fixed GenericDelayResolver not taking into account ranges that overlap into the next cycle (e.g. 22:00 to 5:00).
  • Fixed setting proxy realm having no effect on GenericHttpClientFactory.

New in Norconex HTTP Collector 2.7.2 Build 20170712 Snapshot (Jul 12, 2017)

  • Maintenance release

New in Norconex HTTP Collector 2.7.2 Build 20170707 Snapshot (Jul 8, 2017)

  • Dependency updates: Norconex Commons Lang 1.14.0.
  • Fixed GenericDelayResolver not taking into account ranges that overlap into the next cycle (e.g. 22:00 to 5:00).

New in Norconex HTTP Collector 2.7.2 Build 20170701 Snapshot (Jul 3, 2017)

  • RegexLinkExtractor now supports replacement values for regex matches.
  • Fixed improper saving as XML of GenericDelayResolver.

New in Norconex HTTP Collector 2.7.2 Build 20170606 Snapshot (Jun 7, 2017)

  • Dependency updates: Norconex Collector Core 1.8.3.

New in Norconex HTTP Collector 2.7.1 Snapshot (May 26, 2017)

  • Dependency updates: Norconex Collector Core 1.8.2.
  • Fixed "caseSensitive" flag sometimes having no effect in GenericRecrawlableResolver.

New in Norconex HTTP Collector 2.7.0 (Apr 27, 2017)

  • Feature release

New in Norconex HTTP Collector 2.7.0.20170418 Snapshot (Apr 20, 2017)

  • New PhantomJSDocumentFetcher which relies on an external installation of PhantomJS to fetch documents. This provides a way to crawl web pages with JavaScript-generated content.
  • New XMLFeedLinkExtractor to extract links out of RSS or Atom feeds.
  • New RegexLinkExtractor to extract links out of any text file using regular expressions.
  • Added schema-based XML configuration validation which can be trigged on command prompt with this new flag: -k or --checkcfg
  • Added "removeTrailingHash" to GenericURLNormalizer.
  • New "detectContentType" and "detectCharset" options on GenericDocumentFetcher for ignoring the content type and character encoding obtained from the HTTP response headers and detect them instead.
  • New features from dependency updates. Collector Core: ICollectorLifeCycleListener. Importer: MergeTagger, ExternalTransformer.
  • MongoCrawlDataStoreFactory now accepts encrypted passwords.
  • Now distributed with utility scripts.
  • XML configuration entries expecting millisecond durations can now be provided in human-readable format (e.g., "5 minutes and 30 seconds" or "5m30s").
  • Dependency updates: Norconex Collection Core 1.8.0, Norconex Commons Lang 1.13.0, Jetty 9.2.20.
  • Crawler events REJECTED_FILTER, REJECTED_BAD_STATUS, REJECTED_IMPORT, REJECTED_ERROR, REJECTED_ROBOTS_TXT, and REJECTED_CANONICAL, REJECTED_REDIRECTED are now DEBUG in log4j.properties.
  • Modified Javadoc to include an XML usage example for all XML-configurable classes.
  • Fixed OverlappingFileLockException when loading the a sitemap store more than once per JVM instance.
  • Fixed NullPointerException from GenericLinkExtractor when encountering HTML files with an empty href in their base tag.
  • Fixed minor errors in writing IXMLConfigurable classes to XML.
  • Removed MapDB and Apache Derby crawlstore dependencies/implementations which were deprecated in previous versions.

New in Norconex HTTP Collector 2.6.2 (Feb 10, 2017)

  • New "sslProtocols" configuration option on GenericHttpClientFactory to specify which SSL/TLS protocols are supported (e.g., SSLv3, TLSv1, TLSv1.1, TLSv1.2). Useful to enable TLSv1.2 on Java 7 when encountering SSL handshake problems on site supporting TLSv1.2 only.
  • Corrected documentation in collector-http-config-reference.xml to indicate default orphan strategy is now PROCESS.

New in Norconex HTTP Collector 2.6.1.20161008 Snapshot (Oct 21, 2016)

  • New "addDomainTrailingSlash" normalization rule for GenericURLNormalizer.
  • Dependency updates: Norconex Commons Lang 1.12.2, JJ2000 5.3, Norconex Collection Core 1.7.0.

New in Norconex HTTP Collector 2.6.1.20160829 Snapshot (Aug 30, 2016)

  • New "addDomainTrailingSlash" normalization rule for GenericURLNormalizer.
  • Dependency updates: Norconex Commons Lang 1.12.0, Norconex Collection Core 1.6.1.

New in Norconex HTTP Collector 2.6.0 (Aug 26, 2016)

  • New normalization rule for GenericURLNormalizer: removeTrailingSlash.
  • Can now specify "notFoundStatusCodes" on GenericMetadataFetcher.
  • Specifying an empty "path" tag in XML config or, setting a null or empty string array on StandardSitemapResolverFactory#setSitemapPaths(...) method will now prevent trying to locate sitemaps using default locations and will strictly rely on sitemap URLs specified as start URLs or found in robots.txt (if enabled).
  • GenericLinkExtractor no longer extracts URL from HTML/XML comments by default. To re-enable this behavior a new "setCommentsEnabled(boolean)" method has been added.
  • Normalization rule "addTrailingSlash" in GenericURLNormalizer has been renamed "addDirectoryTrailingSlash".
  • SitemapStore now uses MVStore instead of MapDB for storing the list of processed sitemaps.
  • Referrer data is now always stored for GenericLinkExtractor (default) and TikaLinkExtractor.
  • Redirects encountered by IHttpMetadataFetcher implementations are now followed by default.
  • Dependency updates: Norconex Collector Core 1.6.0.
  • Renamed constant HttpMetadata.COLLECTOR_REFERNCED_URLS to HttpMetadata.COLLECTOR_REFERENCED_URLS (was misspelled).
  • API break: method signature changed for IHttpMetadataFetcher from Properties fetchHTTPHeaders(HttpClient httpClient, String url) to HttpFetchResponse fetchHTTPHeaders( HttpClient httpClient, String url, Properties headers)
  • The subject for the crawler event HttpCrawlerEvent.DOCUMENT_METADATA_FETCHED is now an instance of HttpFetchResponse.
  • Extracted canonical URLs now have their referrer reference stored with them.
  • Extracted URLs are now stored in crawl store.
  • Now using MVStore for sitemap store (instead of MapDB).
  • Fixed GenericRecrawlableResolver#MinFrequency() constructor with arguments not setting patterns correctly.
  • Fixed documents wrongfully being considered orphans when referrer was skipped for being unmodified or premature, or was in temporary error, before its URLs could be extracted. This could cause valid documents to be deleted/ignored (depending on orphan strategy used). Now those "child" URLs will be queued for processing as if they were extracted from referrer page.
  • Canonical URLs extracted are now normalized before being compared to their containing page URL, which were already normalized.

New in Norconex HTTP Collector 2.5.1 (Jul 28, 2016)

  • Updated references to test pages URLs to now be "https" in "minimum" and "complex" example configurations. #252
  • Maven dependency updates: Joda Time 2.9.4, Apache HttpClient 4.5.2, JSoup 1.8.3.

New in Norconex HTTP Collector 2.5.1-20160628 Snapshot (Jun 30, 2016)

  • Updated references to test pages URLs to now be "https" in "minimum" and "complex" example configurations. #252
  • Maven dependency updates: Norconex Collector Core 1.5.1, Joda Time 2.9.4, Apache HttpClient 4.5.2.

New in Norconex HTTP Collector 2.5.0 (Jun 8, 2016)

  • Fixed AbstractDelay wrongfully computing seconds when logging debug statement.

New in Norconex HTTP Collector 2.5.0-20160422 Snapshot (Apr 28, 2016)

  • Now supports sitemap "lastmod" and "changefreq", and also allows for specifying custom minimum recrawl frequency for documents. Default implementation is GenericRecrawlableResolver (implements new IRecrawlableResolver).
  • Document "crawl date" and content type can be added to the crawl data and will be stored in the crawl data store (affects all ICrawlDataStoreFactory implementations).
  • MVStore is now the default URL crawl store.
  • Maven dependency updates: Norconex Collector Core 1.5.0.
  • StandardSitemapResolver now parses sitemaps with content type application/gzip (in addition to application/x-gzip).
  • Fixed "stayOnDomain" being true not being honored for extracted URLs with no scheme (//www.example.com).

New in Norconex HTTP Collector 2.4.1-20160306 Snapshot (Mar 10, 2016)

  • Maven dependency updates: Norconex Collector Core 1.4.1.

New in Norconex HTTP Collector 2.4.0 (Mar 5, 2016)

  • New configurable IRedirectURLProvider with default implementation GenericRedirectURLProvider now better handling redirect character encoding and offering encoding options.
  • HTTP authentication password and proxy password can now be encrypted when using GenericHttpClientFactory.
  • The "log4j.properties" file has been moved from classes to the installation root directory.
  • GenericLinkExtractor now only supports these URI schemes: http, https, ftp. It is possible to overwrite these default with #setSchemes(String[]).
  • GenericLinkExtractor now attempts by default to detect the character encoding of documents before extracting links and referrer information. In addition, it is now possible to explicitly set which charset to use via the #setCharset(String) method.
  • Derby unit tests are now much faster after settings its system durability for testing to "test".
  • Added more logging to FORM-based authentication to help troubleshoot authentication issues.
  • Maven dependency updates: Apache HttpClient 4.5.1, Norconex Collector Core 1.4.0, Joda Time 2.9.2.
  • Robot rules now escapes dots and other character having special meaning in regex, before converting robot rules to regex.
  • GenericLinkExtractor now only strips the inside of script tags in HTML pages before extracting links, so that the src links can be followed if configured to do so.
  • GenericLinkExtractor is now reading characters as opposed to bytes to fix breaking some characters.
  • Robots.txt "Allow" directive is now supported properly as defined by Google bot usage.
  • URLStatusCrawlerEventListener no longer prefixes file names with "null" when no file name prefix is explicitly set.
  • Fixed both GenericLinkExtractor and TikaLinkExtractor which were not able to extract link titles when keepReferrerData was true.
  • Text is now extracted when HTML is found in an anchor body (when keepReferrerData is true).
  • GenericHttpClientFactory#trustAllSSLCertificates now disables hostname verification and algorithm constraints.
  • Fixed "lenient" flag having no effect on "sitemapResolverFactory".
  • Fixed "base href" HTML tag not being recognized when resolving relative URLs in HTML pages.
  • When the "class" attribute is not specified in "sitemapResolverFactory" for XML config, the default resolver instance is now used (as opposed to null, which was causing some configs not to be loaded properly).
  • GenericHttpClientFactory#trustAllSSLCertificates will now disable SSL SNI Extension, and unable unsafe cert algorithms that are now disabled by default since Java 7.

New in Norconex HTTP Collector 2.4.0 Snapshot (Nov 17, 2015)

  • HTTP authentication password and proxy password can now be encrypted when using GenericHttpClientFactory.
  • Derby unit tests are now much faster after settings its system durability for testing to "test".
  • GenericHttpClientFactory#trustAllSSLCertificates now disables hostname verification and algorithm constraints. #196
  • Fixed "lenient" flag having no effect on "sitemapResolverFactory". #193
  • Fixed "base href" HTML tag not being recognized when resolving relative URLs in HTML pages. #188
  • When the "class" attribute is not specified in "sitemapResolverFactory" for XML config, the default resolver instance is now used (as opposed to null, which was causing some configs not to be loaded properly). #187
  • GenericHttpClientFactory#trustAllSSLCertificates will now disable SSL SNI Extension, and unable unsafe cert algorithms that are now disabled by default since Java 7.

New in Norconex HTTP Collector 2.3.0 (Nov 17, 2015)

  • GenericHttpClientFactory now allows you to set HTTP request headers on every HTTP calls a crawler will make.
  • New crawler configuration options: stayOnProtocol, stayOnDomain, and stayOnPort. These new settings can be applied as attributes to the startURLs tag in an XML configuration, or on the object returned by HttpCrawlerConfig#getURLCrawlScopeStrategy(). This addition affects or replace previous implementations discussed in #138 , #135 , #131 , #17 , and possibly others.
  • It is now possible to specify one or more sitemap URLs as "start URLs". New HttpCrawlerConfig#[set|get]StartSitemapURLs(...) methods.
  • GenericURLNormalizer now has the few normalizations by default, as described in its Javadoc.
  • New StandardSitemapFactory#setPaths(...) method to specify where to look for sitemap files for each URLs processed (relative to URL root).
  • The "sitemap" tag used to set the ISitemapResolverFactory implementation has been renamed to "sitemapResolverFactory" to avoid confusion with the "sitemap" tag that can now be set as a start URL.
  • URL normalization now always takes place by default, using GenericURLNormalizer. Can be turned off by either setting it to null in the crawler configuration, or invoking GenericURLNormalizer#setDisabled(true).
  • URLs extracted from a document are now stored in "collector.referenced-urls" after they have been normalized.
  • URL redirects are now logged as REJECTED_REDIRECTED (log level INFO).
  • HtmlLinkExtractor has been deprecated in favor of GenericLinkExtractor.
  • HttpCrawlerConfig#[set|get]UrlsFiles(...) has been deprecated in favor of HttpCrawlerConfig#[set|get]StartUrlsFiles(...)
  • StandardSitemapFactory#setLocations(...) now deprecated in favor of being able to specify sitemaps as start URLs.
  • ISitemapResolver#resolveSitemaps(...) as a new argument to specify whether the sitemap locations provided were defined as "start URLs" or not.
  • Now logs User-Agent upon startup (log level INFO).
  • Maven dependency updates: Norconex Collector Core 1.3.0, Norconex Commons Lang 1.8.0.
  • Added new event types loggers to log4j.properties.
  • Corrected typos and improved documentation.
  • Saved and loaded configuration-related classes are now equal. Methods equals/hashCode/toString for those classes are now implemented uniformly and where added where missing.
  • Fixed some configuration classes not always being saved to XML properly or giving errors.
  • Relative redirect URLs are now converted to absolute.
  • Fixed robots.txt being fetched before reference filters were executed. Robots.txt are no longer fetched for rejected reference.
  • GenericLinkExtractor now unescapes HTML entities in URLs.
  • Fixed ClassCastException in HttpCrawlerRedirectStrategy when using an HTTP Proxy.
  • GenericCanonicalLinkDetector now supports links that are escaped (HTML-entities). They are now unescaped before they are processed.
  • Fixed circular redirect exception.
  • GenericLinkExtractor and TikaLinkExtractor now extract meta http-equiv refresh properly when "refresh" is without quotes or not lowercase.
  • Fixed duplicate commits when multiple URL redirects are pointing to the same target URL.
  • Fixed possible URISyntaxException in GenericURLNormalizer (fixed by updated Norconex Commons Lang URLNormalizer dependency).

New in Norconex HTTP Collector 2.2.1 Snapshot (Jul 31, 2015)

  • StandardRobotsTxtProvider now handles empty "Disallow: " properly."

New in Norconex HTTP Collector 2.2.0 (Jul 31, 2015)

  • Added support for canonical links defined in both HTTP Headers or as a link tag in an HTML document head tag. Canonical links detection is always performed unless explicitly disabled. #79
  • New URLStatusCrawlerEventListener class for producing reports of fetched URLs and their status. Useful for finding broken links or else.
  • Added three new configuration options to GenericHttpClientFactory to better deal with HTTP connectivity issues (like timeouts): "maxConnectionsPerRoute", "maxConnectionIdleTime", and "maxConnectionInactiveTime". #118
  • New LastModifiedMetadataChecksummer that uses Last-Modified HTTP header value for checksum purposes, replacing HttpMetadataChecksummer as the default implementation. For choosing one or more fields of your choice to create a checksum, you can now use the new GenericMetadataChecksummer from the Collector Core dependency.
  • New CurrentDateTagger, DateMetadataFilter, NumericMetadataFilter, TextPatternTagger, GenericSpoiledReferenceStrategizer and more new features introduced by dependency upgrades.
  • New method GenericDocumentFetcher#setNotFoundStatusCodes(int...) to specify one or several custom "Not Found" HTTP codes. Default is 404.
  • GenericHttpClientFactory default maximum connection was increased from 20 to 200 and default maximum connections per route was increased from 2 to 20. #118
  • New HttpFetchResponse class now passed to crawl event listeners after a document fetch instead of the IHttpDocumentFetcher used. This adds the ability to listen for specific HTTP response status code. As a consequence, IHttpDocumentFetcher now returns a HttpFetchResponse.
  • HttpMetadataChecksummer has been deprecated in favor of LastModifiedMetadataChecksummer.
  • HtmlLinkExtractor now supports specifying tags without an attribute for detecting URLs.
  • HtmlLinkExtractor now ignores whatever is found between "script" tags so that JavaScript-generated URLs can no longer cause trouble. #119
  • Maven dependency updates: Norconex Collector Core 1.2.0, Joda Time 2.8.1, Apache HTTP Client 4.5, Jetty Webapp 9.2.11.v20150529, Apache Ant 1.9.5.
  • Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml).
  • Improved/fixed javadoc.
  • HttpCrawlState#NOT_FOUND was migrated to Norconex Collector Core CrawlState#NOT_FOUND.
  • Fixed HTML documents being skipped when HtmlLinkExtractor found a URL of invalid format. Now a warning is thrown for each bad URLs instead and the document is processed anyway, and good URLs are extracted. #119
  • Fixed MongoDB stage/depth compound index. #97
  • Fixed MongoCrawlDataSerializer "referrerLinkText" link attribute/metadata having the same value as "referrerLinkTag". #82
  • HtmlLinkExtractor now decodes encoded ampersands present in URLs. #88
  • Both HtmlLinkExtractor and TikaLinkExtractor no longer extract empty href links. Added this use case to corresponding unit test. #87
  • HtmlLinkExtractor now strips leading spaces.
  • Fixed "trustAllSSLCertificates" configuration option on GenericHttpClientFactory not being recognized in XML config. #100
  • Fixed exceptions thrown in StandardRobotsTxtProvider when robots.txt contained rules ending with ? or when the referrer URL was starting with a space.
  • Several invalid characters are now supported in URLs (e.g., commas).
  • Fixed GenericDelayResolver not saving in XML properly and fixed its javadoc as well, which did not mention how to pass the delay in XML config.
  • TargetURLRedirectStrategy no longer throws an exception when redirects are disabled for a request. #124
  • Fixed "sitemapLocations" and "lenient" not being applied to StandardSitemapFactory.

New in Norconex HTTP Collector 2.2.0 Snapshot (Apr 14, 2015)

  • New CurrentDateTagger introduced by dependency upgrades.
  • Library updates: Norconex Collector Core 1.2.0

New in Norconex HTTP Collector 2.1.0 (Apr 9, 2015)

  • Several new features, updates and fixes were added by upgrading Norconex Collector Core (http://www.norconex.com/collectors/collector-core/) and Norconex Importer (http://www.norconex.com/collectors/importer/) dependencies. Those include support for ORC, translation, a title generator, new content type parsing, and more. Refer to dependency release notes for more details.
  • New methods and configuration attribute to disable checksum creation in HttpMetadataChecksummer.
  • Sitemap resolving pipeline stage is now always invoked (but won't do anything if disabled).
  • Library updates: Norconex Collector Core 1.1.0, JUnit 4.12, Joda-Time 2.7, Apache HTTP Components 4.4, Jetty Webapp 9.2.10.v20150310, Fongo 1.6.2.
  • Improved javadoc.
  • Added many unit tests for testing start vs resume vs stop vs deleted vs modified, in different JVM instances.
  • "minimum" and "complex" configuration examples now ignore sitemap.xml files.
  • Robot meta data found in HTML pages will no longer be extracted if found within an HTML comment. Robot metadata detection is also more robust (github #60 ).
  • Fixed NPE in HttpImporterPipelineUtil#enhanceHTTPHeaders when content type from HTTP header is not defined.
  • Fixed log4j log levels incorrectly ending with a semi-colon.

New in Norconex HTTP Collector 2.1.0 Snapshot (Mar 3, 2015)

  • Fixed link extractor not fetching link text properly when keepReferrerData is true on HtmlLinkExtractor
  • Upgraded Norconex Collector Core to 1.1.0.
  • Added Sonatype repository to pom.xml for snapshot releases.
  • Improvements on character encoding detection from HTTP headers.
  • log4j.properties from ./classes/ now properly loaded by collector-http.sh

New in Norconex HTTP Collector 2.0.2 (Feb 10, 2015)

  • Fixed the collector "stop" action having no effect (github #49 ).
  • Fixed crawl data wrongfully applied as metadata after the import phase.
  • Fixed NullPointerException when sitemap support is disabled.
  • Fixed incorrect deletion behavior for embedded orphan documents.
  • Improved log4j.properties logging options for crawler events.
  • Upgraded Norconex Collector Core dependency to 1.0.2.

New in Norconex HTTP Collector 2.0.1 (Dec 4, 2014)

  • From collector-core-1.0.1: When keepDownloads is true, saved files and directories are now prefixed with "f." and "d." respectively to avoid collisions.
  • Fixed errors in example configuration files.

New in Norconex HTTP Collector 2.0.0 (Nov 28, 2014)

  • Upgraded Norconex Importer to version 2.0.0, which brings to Norconex HTTP Collector a lot of new features, such as: Document content splitting, splitting of embedded documents into individual documents, new taggers for language detection, changing character case, parsing and formatting dates, providing content statistics, and more. Please read the Norconex Importer release notes for a complete list of changes at: http://www.norconex.com/product/importer/changes-report.html#a2.0.0
  • Can now supplied a "pathsFile" as part of the startPaths, acting as a seed list.
  • New fast MVStore database implementation for URL database (from Norconex Collector Core).
  • New H2 database implementation for URL database (crawl data store).
  • Now keeps track of parent references (for embedded/split documents).
  • More unit tests, with the addition of an embedded Jetty Web server rendering test pages for some unit tests.
  • New JMX/MBean support added on crawlers.
  • IUrlExtractor is now ILinkExtractor and both their implementing classes (HtmlLinkExtractor and TikaLinkExtractor) now support also extracting a link title and text (github #23 ), and they also support the "nofollow" robot rule.
  • It is now possible to configure multiple link extraction classes, each taking effect on particular URLs and/or content-types.
  • IHtmlLinkExtractor can be configured to use specified HTML tags and attributes to find URLs.
  • Now licensed under The Apache License, Version 2.0.
  • Replaced the configuration option "deleteOrphans(true|false)" with "orphansStrategy(DELETE|PROCESS|IGNORE)".
  • The collector now references document content as reusable InputStream with memory caching instead of relying only on files. This saves a great deal of disk I/O and improves performance in most cases.
  • Refactored to use the new Norconex Collector Core library. A significant portion of the Norconex HTTP Collector code has been moved to that core library. Some of the moved classes are (base package being com.norconex.collector.*): http.checksum.IHttpDocumentChecksummer to core.checksum.IDocumentChecksummer, http.checksum.IHttpHeadersChecksummer to core.checksum.IMetadataChecksummer, http.checksum.DefaultHttpDocumentChecksummer to core.checksum.impl.MD5DocumentChecksummer, http.filter.IURLFilter to core.filter.IReferenceFilter http.filter.IHttpHeadersFilter to core.filter.IMetadataFilter http.filter.IHttpDocumentFilter to core.filter.IDocumentFilter http.filter.impl.ExtensionURLFilter to core.filter.impl.ExtensionURLFilter http.filter.impl.RegexHeaderFilter to core.filter.impl.RegexMetadataFilter http.filter.impl.RegexURLFilter to core.filter.impl.RegexReferenceFilter
  • Amongst others, the following classes were renamed (within com.norconex.collector.http.*): checksum.impl.DefaultHttpHeadersChecksummer to checksum.impl.HttpMetadataChecksummer, client.impl.DefaultHttpClientFactory to client.impl.GenericHttpClientFactory, delay.impl.DefaultDelayResolver to delay.impl.GenericDelayResolver, fetch.impl.DefaultDocumentFetcher to fetch.impl.GenericDocumentFetcher, fetch.impl.SimpleHttpHeadersFetcher to fetch.impl.GenericHttpHeadersFetcher, robot.impl.DefaultRobotsMetaProvider to robot.impl.StandardRobotsMetaProvider, robot.impl.DefaultRobotsTxtProvider to robot.impl.StandardRobotsTxtProvider, sitemap.impl.DefaultSitemapResolver to sitemap.impl.StandardSitemapResolver, url.impl.DefaultURLExtractor to url.impl.GenericURLExtractor
  • Several references to "url" were changed to "reference".
  • New and more scalable crawler event model along with new listeners.
  • Refactored to use JEF 4.0.0 which makes the HTTP Collector easier to monitor.
  • Other libray upgrades: Norconex Committer to 2.0.0 and Norconex Commons Lang to 1.5.0.
  • Removed previously deprecated classes.
  • Crawled sitemap details now has its own store (no longer mixed with the crawl data store).
  • ISiteMapResolver now needs an ISiteMapResolverFactory.
  • Sitemap resolution now stops when a stop request is ussued
  • Now checks if crawler is running before attempting to stop it

New in Norconex HTTP Collector 1.3.4 (Aug 27, 2014)

  • MongoCrawlURLDatabase now supports user authentication.
  • Now requires Java 7 or higher.
  • Fixed DefaultRobotsTxtProvider failing to parse some robots.txt patterns.

New in Norconex HTTP Collector 1.3.3 (Aug 8, 2014)

  • Upgraded JEF to 3.0.1 to fix stop action not working.
  • Fixed NullPointerException in robots.txt resolution under some circonstances.

New in Norconex HTTP Collector 1.3.2 (Jun 18, 2014)

  • DefaultURLExtractor no longer treat empty href as being a URL ending with a double-quote.
  • Renamed HttpMetadata key "collector.http.dept" to "collector.http.depth" (typo fix).
  • Upgraded Norconex Commons Lang to 1.3.2
  • GenericURLNormallizer no longer rejects URLs with spaces in them. It now logs a warning instead

New in Norconex HTTP Collector 1.3.1 (Jun 18, 2014)

  • Header and document checksum value are no longer added by default to prevent the issue described in github ticket #24. Instead, adding checksum is now an optional feature of DefaultHttpDocumentChecksummer and DefaultHttpHeadersChecksummer.

New in Norconex HTTP Collector 1.3.0 (Mar 25, 2014)

  • Now supports NTLM authentication. SPNEGO and Kerberos were also added but are experimental (see DefaultHttpClientFactory).
  • Can now specify character set of HTTP connections and authentication forms.
  • Can now set custom timeout values on HTTP connection-related activities.
  • New option to trust all SSL certificates of sites being crawled (see DefaultHttpClientFactory).
  • Can now specify a maximum number of HTTP connections for each crawler independently of configured number of threads (see DefaultHttpClientFactory).
  • DefaultHttpClientFactory introduces additional configuration options: proxy scheme, 'Expect: 100-continue' handshake, maximum HTTP redirects, local address, stale connection checks
  • HTTP header checksum and document checksum are now added to the document metadata as HttpMetadata#CHECKSUM_HEADER and HttpMetadata#CHECKSUM_DOC.
  • The empty sub-folders contained under the "download" folder are now periodically deleted. This speeds up directory scanning and increases performance on large crawls.
  • The userAgent is now a crawler configuration option (previously was an option of DefaultHttpClientInitializer )
  • API change: IRobotsTxtProvider#getRobotsTxt(...) method signature has changed to accept the User-Agent.
  • IHttpClientInitializer is now deprecated in favor of IHttpClientFactory, giving you more control over HttpClient creation.
  • API change: Methods previously accepting DefaultHttpClient instances now have their signature accepting parent interface HttpClient instead.
  • More logging to help resolve crawler issues with DEBUG log level. essiembp
  • HttpCrawler more lenient upon encountering some errors that were previous aborting entire execution.
  • Library upgrades. Updated default crawl url database (MapDB) to version 0.9.10, Norconex Commons Lang to 1.3.0, Norconex Committer to 1.2.0, Norconex Importer to 1.2.0, and Apache HttpClient to 4.3.2.
  • Now ensures that robots.txt agent matching gives priority to the most specific match (as opposed to the first match). Sitemaps detected in robots.txt are also preserved for sitemap resolving.
  • Removed classes deprecated since 1.1.

New in Norconex HTTP Collector 1.2.0 (Feb 5, 2014)

  • New optional Mongo URL Database implementation.
  • New TikaURLExtractor class providing an alternate IURLExtractor implementation based on Apache Tika HTMLParser.
  • New SegmentCountURLFilter class for filtering URLs having a specified number of segments (can check duplicate segments too).
  • New unit tests.
  • MapDB URL Database classes moved to its own "mapdb" package. DefaultCrawlURLDatabaseFactory still exists, but is just a pointer to MapDBCrawlURLDatabaseFactory.
  • Example configurations now point to Norconex test pages to ensure their stability.
  • Upgraded dependent libraries: Norconex Committer 1.1.0, Norconex Commons Lang 1.2.0, MapDB 0.9.8 and other thrid party libraries.

New in Norconex HTTP Collector 1.1.1 (Feb 5, 2014)

  • Fixed not being able to extract the "href" attribute when it starts a new line.
  • Fixed HTTP redirects not storing final target URL but the source URL instead.
  • Upgraded dependent libraries to Norconex Importer 1.1.0 and Norconex Commons Lang 1.1.0.

New in Norconex HTTP Collector 1.1.0 (Feb 5, 2014)

  • Crawler now fires additional events. Added documentRobotsMetaRejected" and "documentImportRejected" methods to IHttpCrawlerEventListener.
  • DefaultCrawlURLDatabase now uses a MapDB-based implementation for faster performance. The derby implementation has been kept for those with a preference for it.
  • Now support sitemap.xml and sitemap index (plain or gzip). URLs from sitemaps will have the sitemap information as metadata.
  • BASIC and DIGEST authentication now supported.
  • Now supports in-page robot instructions. Via "ROBOTS" meta tag first, or "X-Robots-Tag" tag if present in HTTP header.
  • "ftp" protocol now supported.
  • It is now possible to specify the scope of each delay between URL download with DefaultDelayResolver (per crawler, site, or thread). "crawler" is default.
  • Javadoc Jar and Source Jar are now also deployed to Maven repository.
  • Deprecation of *.handler.* package. Classes have been moved to more intuitive packages.
  • IDelayResolver are no longer systematically synchronized (i.e. accessible only one thread a a time). This is a decision left up to each implementation.
  • Reduced the number of calls to the crawl database to improve performance (URL filtering but be successfully passed for a document to get queued for processing).
  • Upgraded to Norconex Importer from 1.0.0 to 1.0.2.
  • Upgraded to Norconex Commons Lang from 1.0.0 to 1.0.1.
  • Connection timeout now 30 seconds in DefaultHttpClientInitializer.
  • When deleting downloaded files, the text file from the importer output is now also deleted as URLs are being processed, greatly reducing the number of files on disk during the crawling process.
  • The application now prints the full path of an error file when an error occurs on startup (error-*.log).
  • When reprocessed, orphan URLs are processed/filtered in case there has been a configuration change. For instance, if if the maxDepth was reduced, former URLs that were once good may now be invalid due to their depth being too big. Only "valid" URLs are now considered for detecting orphans (OK or UNMODIFIED statuses).
  • Fixed java.io.NotSerializableException when cloning HttpCrawlerConfig (when is used).

New in Norconex HTTP Collector 1.0.2 (Feb 5, 2014)

  • DefaultURLExtractor now handle URLs starting with ? properly and those prefixed with "URL=" (e.g. meta http-equiv="refresh")

New in Norconex HTTP Collector 1.0.1 (Feb 5, 2014)

  • Relative URLs starting with two slashes (//) are now extracted properly.
  • Replacements now working for GenericURLNormalizer.