Norconex HTTP Collector icon

Norconex HTTP Collector

4.4/5 11
Certified 100% FREE Apache License 2.0   

A powerful web crawler that can also extract and manipulate documents in order to retrieve the required information on the Internet. #Web spider  #Web crawler  #Internet crawler  #Spider  #Crawler  #Search  

Description

features

changelog

Free Download

Search engines allow the Internet to be interpreted in a meaningful way, as otherwise one would have to waste a lot of time finding information.

As essential tools when surfing online, developers have been constantly preoccupied with improving these utilities.

Norconex HTTP Collector is one such auxiliary tool that can be employed to crawl sites quickly and return results to a local folder or feed them directly to a search engine.

The application supports multi-threaded operations, thus ensuring that adequate results are received with little time being wasted. This ability can be especially useful when dealing with particularly large websites.

Once a target has been specified, the program automatically attempts to detect the language and text can be extracted from all the attached pictures and PDFs, as the library has support for OCR tasks.

Other formats, such as HTMLs and Office documents are supported and the spider can also process canonical URLs.

Several settings can be customized when starting jobs, such as the ability to adjust the crawling speed; also, one can configure the crawler to treat embedded documents as distinct files and hierarchical fields can also be built.

Filtering output documents can be performed based on URL or HTTP headers and metadata information can also be employed towards this end.

For ease of use, several samples are available, allowing developers or users to assess the power of the tool accurately.

A concise online manual can be perused to solve many issues and the forums can also be employed to ensure one obtains good results.

System requirements

  • Java
  • Internet connection

What's new in Norconex HTTP Collector 3.0.2:

  • Fixed GenericSitemapResolver NPE when the sitemap content-type could not be detected. #803
  • Updated Maven dependency updates: norconex-commons-maven-parent 1.0.2, norconex-collector-core 2.0.2, norconex-importer 3.0.1, Guava 32.0.0-jre, Selenium 4.0.0, Jetty 9.4.51.v20230217.
Read the full changelog
User Comments
This enables Disqus, Inc. to process some of your data. Disqus privacy policy

Norconex HTTP Collector 3.0.2

add to watchlist add to download basket send us an update REPORT
PRICE: Free
  runs on:
Windows 11
Windows 10 32/64 bit
Windows 8 32/64 bit
Windows 7 32/64 bit
Windows Vista 32/64 bit
  file size:
102 MB
  filename:
norconex-collector-http-3.0.2.zip
  1 screenshot:
Norconex HTTP Collector - The crawler can perform both basic and advanced tasks and an output folder stores all the results
  main category:
Internet
  developer:
  visit homepage