Apache Lucene Changelog

What's new in Apache Lucene 9.3.0

Jul 29, 2022
  • Merge on full flush is enabled now by default with a timeout of 500ms, giving the merge policy a chance to merge NRT segments together before publishing a new point-in-time view of the IndexReader. This should give queries a small performance boost in the near-realtime case, especially terms-dictionary-intensive queries like fuzzy queries.
  • Add getAllChildren functionality to facets.
  • Added facetsets module for high dimensional (hyper-rectangle) faceting.
  • Top-level two-clause disjunctions sorted by score now use the block-max MAXSCORE algorithm, which introduced a 40%-75% speedup in our benchmarks.
  • BooleanQuery can return quick counts for simple boolean queries.
  • When running KnnVectorQuery with a filter, reuse the cached filter bit set.

New in Apache Lucene 8.11.2 (Jun 21, 2022)

  • Bug fixes:
  • LUCENE-10564: Make sure SparseFixedBitSet#or updates ramBytesUsed.
  • LUCENE-10477: Highlighter: WeightedSpanTermExtractor.extractWeightedSpanTerms to Query#rewrite multiple times if necessary.
  • Optimizations:
  • LUCENE-10481: FacetsCollector will not request scores if it does not use them.

New in Apache Lucene 8.11.1 (Dec 16, 2021)

  • Log4j is upgraded to v2.16.0 to mitigate CVE-2021-44228 (for Luke users)

New in Apache Lucene 8.11.0 (Nov 16, 2021)

  • Facets now properly ignore deleted documents when accumulating facet counts for all documents.
  • CheckIndex can run concurrently.

New in Apache Lucene 8.10.1 (Oct 18, 2021)

  • MultiCollector now handles single leaf collector that wants to skip low-scoring hits but the combined score mode doesn't allow it.
  • Fix for sort optimization with search_after that was wrongly skipping document whose values are equal to the last value of the previous page.
  • Fix for sort optimization with a chunked bulk scorer that was wrongly skipping documents.

New in Apache Lucene 8.10.0 (Sep 28, 2021)

  • New features:
  • Multi-valued fields are now supported in numeric range facet counting
  • Added new analyzer for Telugu
  • Near-real-time readers opened from an IndexCommit can now sort their leaves
  • SimpleText codec now implements skipping for its postings lists
  • Optimizations:
  • Performance improvements for faceting, including a new protected API to control which fields are counted for drill-down during drill sideways, and optimized drill sideways iterating
  • RegexpQuery's detection of adversarial (ReDoS) regular expressions is improved, catching exotic cases that it missed before, and throwing TooComplexToDeterminizeException
  • Speedup for computing the leading prefix and trailing suffix from an Automaton, and for managing powersets during determinize
  • Speedups for stored fields retrieval with the default codec (BEST_SPEED)
  • IndexWriter uses less RAM when buffering documents, especially in the case of many unique fields
  • forceMerge will now merge any number of segments at once, making it much faster in many cases
  • Compression improvements for docvalues storage

New in Apache Lucene 8.9.0 (Jun 18, 2021)

  • Compression was added to SortedSet DocValues, which allowed to significantly reduce their size on disk.
  • BM25FQuery was extended to handle similarities beyond BM25Similarity. It was renamed to CombinedFieldQuery to reflect its more general scope.
  • A new PatternTypingFilter was added to allow setting a type attribute on tokens based on a configured set of regular expressions.
  • An option was added to supply a custom leaf sorter for IndexWriter and DirectoryReader, which allows to speed up sort queries with a provided sort criteria.

New in Apache Lucene 8.9.0 (Jun 18, 2021)

  • Compression was added to SortedSet DocValues, which allowed to significantly reduce their size on disk.
  • BM25FQuery was extended to handle similarities beyond BM25Similarity. It was renamed to CombinedFieldQuery to reflect its more general scope.
  • A new PatternTypingFilter was added to allow setting a type attribute on tokens based on a configured set of regular expressions.
  • An option was added to supply a custom leaf sorter for IndexWriter and DirectoryReader, which allows to speed up sort queries with a provided sort criteria.

New in Apache Lucene 8.8.2 (Apr 13, 2021)

  • LUCENE-9870: Fix Circle2D intersectsLine t-value (distance) range clamp
  • LUCENE-9744: NPE on a degenerate query in MinimumShouldMatchIntervalsSource$MinimumMatchesIterator.getSubMatches().
  • LUCENE-9762: DoubleValuesSource.fromQuery (also used by FunctionScoreQuery.boostByQuery) could throw an exception when the query implements TwoPhaseIterator and when the score is requested repeatedly

New in Apache Lucene 8.8.1 (Feb 23, 2021)

  • No changes from 8.8.0

New in Apache Lucene 8.8.0 (Feb 1, 2021)

  • New Features:
  • LUCENE-9552: New LatLonPoint query that accepts an array of LatLonGeometries.
  • (Ignacio Vera)
  • LUCENE-9641: LatLonPoint query support for spatial relationships.
  • (Ignacio Vera)
  • LUCENE-9553: New XYPoint query that accepts an array of XYGeometries.
  • (Ignacio Vera)
  • LUCENE-9594: FeatureField supports newLinearQuery that for scoring uses raw indexed values of features without any transformation.
  • (Mayya Sharipova, Adrien Grand)
  • LUCENE-9378: Doc values now allow configuring how to trade compression for retrieval speed.
  • (Adrien Grand)
  • LUCENE-9413: Add CJKWidthCharFilter and its factory
  • (Tomoko Uchida)
  • Improvements:
  • LUCENE-9455: ExitableTermsEnum should sample timeout and interruption check before calling next().
  • (Zach Chen via Bruno Roustant)
  • LUCENE-9023: GlobalOrdinalsWithScore should not compute occurrences when the provided min is 1.
  • (Jim Ferenczi)
  • LUCENE-9675: Binary doc values fields now expose their configured compression mode in the attributes of the field info.
  • (Jim Ferenczi)
  • Optimizations:
  • LUCENE-9536: Reduced memory usage for OrdinalMap when a segment has all values.
  • (Julie Tibshirani via Adrien Grand)
  • LUCENE-9021: QueryParser: re-use the LookaheadSuccess exception.
  • (Przemek Bruski via Mikhail Khludnev)
  • LUCENE-9636: Faster decoding of postings for some numbers of bits per value.
  • (Guo Feng via Adrien Grand)
  • LUCENE-9346: WANDScorer now supports queries that have a `minimumNumberShouldMatch` configured.
  • (Xi Zachary Chen via Adrien Grand)
  • Bug Fixes:
  • LUCENE-9508: DocumentsWriter was only stalling threads for 1 second allowing documents to be indexed even the DocumentsWriter wasn't able to keep up flushing. Unless IW can't make progress due to an ill behaving DWPT this issue was barely noticeable.
  • (Simon Willnauer)
  • LUCENE-9581: Japanese tokenizer should discard the compound token instead of disabling the decomposition of long tokens when discardCompoundToken is activated.
  • (Jim Ferenczi)
  • LUCENE-9595: Make Component2D#withinPoint implementations consistent with ShapeQuery logic.
  • (Ignacio Vera)
  • LUCENE-9606: Wrap boolean queries generated by shape fields with a Constant score query.
  • (Ignacio Vera)
  • LUCENE-9635: BM25FQuery - Mask encoded norm long value in array lookup.
  • (Yilun Cui)
  • LUCENE-9617: Fix per-field memory leak in IndexWriter.deleteAll(). Reset next available internal field number to 0 on FieldInfos.clear(), to avoid wasting FieldInfo references.
  • (Michael Froh)
  • LUCENE-9642: When encoding triangles in ShapeField, make sure generated triangles are CCW by rotating triangle points before checking triangle orientation.
  • (Ignacio Vera)
  • LUCENE-9661: Fix deadlock in TermsEnum.EMPTY that occurs when trying to initialize TermsEnum and BaseTermsEnum at the same time
  • (Namgyu Kim)

New in Apache Lucene 8.7.0 (Nov 4, 2020)

  • Better compression of stored fields. Stored fields now use dictionaries in order to improve the compression ratio when there is a lot of redundancy across documents. This works for both the BEST_SPEED and the BEST_COMPRESSION modes.
  • Faster sorting by field. When a doc-value field is also indexed with points, Lucene now takes advantage of this points index in order to skip documents whose sort value is not competitive.
  • Faster flushing of stored fields when index sorting is enabled. This can significantly speed up indexing when a non-negligible amount of data is stored in the index and index sorting is enabled.

New in Apache Lucene 8.6.3 (Oct 8, 2020)

  • This release contains no additional bug fixes over the previous version 8.6.2.

New in Apache Lucene 8.6.2 (Sep 1, 2020)

  • Bug Fixes:
  • LUCENE-9478: Prevent DWPTDeleteQueue from referencing itself and leaking memory. The queue passed an implicit this reference to the next queue instance on flush which leaked about 500byte of memory on each full flush, commit or getReader call. (Simon Willnauer)

New in Apache Lucene 8.6.1 (Aug 14, 2020)

  • Lucene 8.6.1 Release Highlights:
  • LUCENE-9443: The UnifiedHighlighter was closing the underlying reader when there were multiple term-vector fields.

New in Apache Lucene 8.6.0 (Jul 15, 2020)

  • API Changes:
  • LUCENE-9265: SimpleFSDirectory is deprecated in favor of NIOFSDirectory.
  • (Yannick Welsch)
  • LUCENE-9304: Removed ability to set DocumentsWriterPerThreadPool on IndexWriterConfig. The DocumentsWriterPerThreadPool is a packaged protected final class which made it impossible to customize.
  • (Simon Willnauer)
  • LUCENE-9339: MergeScheduler#merge doesn't accept a parameter if a new merge was found anymore.
  • (Simon Willnauer)
  • LUCENE-9330: SortFields are now responsible for writing themselves into index headers if they are used as index sorts.
  • (Alan Woodward, Uwe Schindler, Adrien Grand)
  • LUCENE-9340: Deprecate SimpleBindings#add(SortField).
  • (Alan Woodward)
  • LUCENE-9345: MergeScheduler is now decoupled from IndexWriter. Instead it accepts a MergeSource interface that offers the basic methods to acquire pending merges, run the merge and do accounting around it.
  • (Simon Willnauer)
  • LUCENE-9349: QueryVisitor.consumeTermsMatching() now takes a Supplier<ByteRunAutomaton> to enable queries that build large automata to provide them lazily. TermsInSetQuery switches to using this method to report matching terms.
  • (Alan Woodward)
  • LUCENE-9366: DocValues.emptySortedNumeric() no longer takes a maxDoc parameter
  • (Alan Woodward)
  • LUCENE-7822: CodecUtil#checkFooter(IndexInput, Throwable) now throws a CorruptIndexException if checksums mismatch or if checksums can't be verified.
  • (Martin Amirault, Adrien Grand)
  • New Features:
  • LUCENE-7889: Grouping by range based on values from DoubleValuesSource and LongValuesSource
  • (Alan Woodward)
  • LUCENE-8962: Add IndexWriter merge-on-commit feature to selectively merge small segments on commit, subject to a configurable timeout, to improve search performance by reducing the number of small segments for searching
  • (Michael Froh, Mike Sokolov, Mike Mccandless, Simon Willnauer)
  • Improvements:
  • LUCENE-9276: Use same code-path for updateDocuments and updateDocument in IndexWriter and DocumentsWriter.
  • (Simon Willnauer)
  • LUCENE-9279: Update dictionary version for Ukrainian analyzer to 4.9.1
  • (Andriy Rysin via Dawid Weiss)
  • LUCENE-8050: PerFieldDocValuesFormat should not get the DocValuesFormat on a field that has no doc values.
  • (David Smiley, Juan Rodriguez)
  • LUCENE-9304: Removed ThreadState abstraction from DocumentsWriter which allows pooling of DWPT directly and improves the approachability of the IndexWriter code.
  • (Simon Willnauer)
  • LUCENE-9324: Add an ID to SegmentCommitInfo in order to compare commits for equality and make snapshots incremental on generational files.
  • (Simon Willnauer, Mike Mccandless, Adrien Grant)
  • LUCENE-9342: TotalHits' relation will be EQUAL_TO when the number of hits is lower than TopDocsColector's numHits
  • (Tomás Fernández Löbbe)
  • LUCENE-9353: Metadata of the terms dictionary moved to its own file, with the `.tmd` extension. This allows checksums of metadata to be verified when opening indices and helps save seeks when opening an index.
  • (Adrien Grand)
  • LUCENE-9359: SegmentInfos#readCommit now always returns a CorruptIndexException if the content of the file is invalid.
  • (Adrien Grand)
  • LUCENE-9393: Make FunctionScoreQuery use ScoreMode.COMPLETE for creating the inner query weight when ScoreMode.TOP_DOCS is requested.
  • (Tomás Fernández Löbbe)
  • LUCENE-9392: Make FacetsConfig.DELIM_CHAR publicly accessible (Ankur Goel))
  • LUCENE-9397: UniformSplit supports encodable fields metadata.
  • (Bruno Roustant)
  • LUCENE-9396: Improved truncation detection for points.
  • (Adrien Grand, Robert Muir)
  • LUCENE-9402: Let MultiCollector handle minCompetitiveScore
  • (Tomás Fernández Löbbe, Adrien Grand)
  • Optimizations:
  • LUCENE-9254: UniformSplit keeps FST off-heap.
  • (Bruno Roustant)
  • LUCENE-8103: DoubleValuesSource and QueryValueSource now use a TwoPhaseIterator if one is provided by the Query.
  • (Michele Palmia, David Smiley)
  • LUCENE-9287: UsageTrackingQueryCachingPolicy no longer caches DocValuesFieldExistsQuery.
  • (Ignacio Vera)
  • LUCENE-9286: FST.Arc.BitTable reads directly FST bytes. Arc is lightweight again and FSTEnum traversal faster.
  • (Bruno Roustant)
  • LUCENE-7788: fail precommit on unparameterised log messages and examine for wasted work/objects
  • (Erick Erickson)
  • LUCENE-9273: Speed up geometry queries by specialising Component2D spatial operations. Instead of using a generic relate method for all relations, we use specialize methods for each one. In addition, the type of triangle is computed at deserialization time, therefore we can be more selective when decoding points of a triangle.
  • (Ignacio Vera)
  • LUCENE-9087: Build always trees with full leaves and lower the default value for maxPointsPerLeafNode to 512.
  • (Ignacio Vera)
  • LUCENE-9148: Points now write their index in a separate file.
  • (Adrien Grand)
  • Bug Fixes:
  • LUCENE-9259: Fix wrong NGramFilterFactory argument name for preserveOriginal option
  • (Paul Pazderski)
  • LUCENE-8849: DocValuesRewriteMethod.visit wasn't visiting its embedded query
  • (Michele Palmia, David Smiley)
  • LUCENE-9258: DocTermsIndexDocValues assumed it was operating on a SortedDocValues (single valued) field when it could be multi-valued used with a SortedSetSelector
  • (Michele Palmia)
  • LUCENE-9164: Ensure IW processes all internal events before it closes itself on a rollback.
  • (Simon Willnauer, Nhat Nguyen, Dawid Weiss, Mike Mccandless)
  • LUCENE-8908: Return default value from objectVal when doc doesn't match the query in QueryValueSource
  • (Bill Bell, hossman, Munendra S N, Michele Palmia)
  • LUCENE-9133: Fix for potential NPE in TermFilteredPresearcher for empty fields
  • (Marvin Justice via Mike Drob)
  • LUCENE-9309: Wait for #addIndexes merges when aborting merges.
  • (Simon Willnauer)
  • LUCENE-9337: Ensure CMS updates it's thread accounting datastructures consistently. CMS today releases it's lock after finishing a merge before it re-acquires it to update the thread accounting datastructures. This causes threading issues where concurrently finishing threads fail to pick up pending merges causing potential thread starvation on forceMerge calls.
  • (Simon Willnauer)
  • LUCENE-9314: Single-document monitor runs were using the less efficient MultiDocumentBatch implementation.
  • (Pierre-Luc Perron, Alan Woodward)
  • LUCENE-9362: Fix equality check in ExpressionValueSource#rewrite. This fixes rewriting of inner value sources.
  • (Dmitry Emets)
  • LUCENE-9405: IndexWriter incorrectly calls closeMergeReaders twice when the merged segment is 100% deleted.
  • (Michael Froh, Simon Willnauer, Mike Mccandless, Mike Sokolov)
  • LUCENE-9400: Tessellator might build illegal polygons when several holes share the shame vertex.
  • (Ignacio Vera)
  • LUCENE-9417: Tessellator might build illegal polygons when several holes share are connected to the same vertex.
  • (Ignacio Vera)
  • LUCENE-9418: Fix ordered intervals over interleaved terms
  • (Alan Woodward)
  • Other:
  • LUCENE-9257: Always keep FST off-heap. FSTLoadMode, Reader attributes and openedFromWriter removed.
  • (Bruno Roustant)
  • LUCENE-9272: Checksums of the terms index are now verified when LeafReader#checkIntegrity is called rather than when opening the index.
  • (Adrien Grand)
  • LUCENE-9270: Update Javadoc about normalizeEntry in the Kuromoji DictionaryBuilder.
  • (Namgyu Kim)
  • LUCENE-9275: Make TestLatLonMultiPolygonShapeQueries more resilient for CONTAINS queries.
  • (Ignacio Vera)
  • LUCENE-9244: Adjust TestLucene60PointsFormat#testEstimatePointCount2Dims so it does not fail when a point is shared by multiple leaves.
  • (Ignacio Vera)
  • LUCENE-9271: ByteBufferIndexInput was refactored to work on top of the ByteBuffer API.
  • (Adrien Grand)
  • LUCENE-9191: Make LineFileDocs's random seeking more efficient, making tests using LineFileDocs faster
  • (Robert Muir, Mike McCandless)
  • LUCENE-9338: Refactors SimpleBindings to improve type safety and cycle detection
  • (Alan Woodward, Adrien Grand)
  • LUCENE-9358: Change the way the multi-dimensional BKD tree builder generates the intermediate tree representation to be equal to the one dimensional case to avoid unnecessary tree and leaves rotation.
  • (Ignacio Vera)
  • LUCENE-9288: poll_mirrors.py release script can handle HTTPS mirrors.
  • (Ignacio Vera)
  • LUCENE-9232: Fix or suppress 13 resource leak precommit warnings in lucene/replicator
  • (Andras Salamon via Erick Erickson)
  • LUCENE-9398: Always keep BKD index off-heap. BKD reader does not implement Accountable any more.
  • (Ignacio Vera)
  • Build:
  • Upgrade forbiddenapis to version 3.0.1.
  • (Uwe Schindler)
  • LUCENE-9376: Fix or suppress 20 resource leak precommit warnings in lucene/search
  • (Andras Salamon via Erick Erickson)
  • LUCENE-9380: Fix auxiliary class warnings in Lucene
  • (Erick Erickson)
  • LUCENE-9389: Enhance gradle logging calls validation: eliminate getMessage()
  • (Andras Salamon via Erick Erickson)

New in Apache Lucene 8.5.2 (May 26, 2020)

  • LUCENE-9350: Don't cache automata on FuzzyQuery

New in Apache Lucene 8.5.1 (Apr 16, 2020)

  • Bug Fixes:
  • LUCENE-9300: Index corruption with doc values updates and addIndexes.

New in Apache Lucene 8.5.0 (Mar 24, 2020)

  • XYPointField allows you to index points in flat X,Y space and efficiently find documents that fall within a bounding box, distance or arbitrary polygon
  • New query builders on LatLonShape allow you to efficiently find documents with a specific relation to a point or polygon
  • You can now store up to 16 data dimensions in a Point field
  • KoreanTokenizer supports custom dictionaries
  • Binary doc values are now compressed, and term dictionaries have improved compression
  • Index flushes are up to 20% faster if all docvalues updates are updating a single field to the same value
  • The index of stored fields and term vectors is now stored off-heap
  • Query parsers based on QueryBuilder can boost particular terms or synonyms by setting BoostAttribute values on a token stream
  • Intervals queries correctly handle repeated subterms in ordered and unordered sources

New in Apache Lucene 8.4.0 (Dec 30, 2019)

  • LatLonShape now supports the "CONTAINS" relation, which enables to find all indexed shapes that contain the query shape.
  • Concurrent search is getting more efficient by allowing collectors to share information across threads in order to more efficiently skip non-competitive hits.
  • Faster FST lookups on dense nodes.
  • Postings are now decoded using SIMD instructions.
  • LRUQueryCache includes new heuristics that prevent caching from hurting latency too much.
  • LatLonShape builds a more efficient tree that is expected to translate into search speed improvements.
  • BaseDirectoryReader no longer sums up document counts across leaves eagerly, allowing for more efficient reader views that hide a subset of documents.
  • The index on top of BKD trees is now stored off-heap with MMapDirectory.
  • Simple Intervals Queries supports highlighting.
  • Reading DocValues can be interrupted when timeout is exceeded.

New in Apache Lucene 8.3.1 (Dec 5, 2019)

  • Bugfix: MultiTermIntervalsSource.visit() was not calling back to its visitor

New in Apache Lucene 8.3.0 (Nov 3, 2019)

  • New SpanishMinimalStemFilter
  • New "export all terms and doc freqs" feature to Luke with delimiters
  • Composite Matches from multiple subqueries now allow access to their submatches, and a new NamedMatches API allows marking of subqueries and a simple way to find which subqueries have matched on a given document
  • Range Query For Multiple Connected Ranges
  • LatLonDocValuesPointInPolygonQuery for LatLonDocValuesField
  • New UniformSplitPostingsFormat (name "UniformSplit") primarily benefiting in simplicity and extensibility
  • New STUniformSplitPostingsFormat (name "SharedTermsUniformSplit") that shares a single internal term dictionary across fields
  • DisjunctionMaxQuery more efficiently leverages impacts to skip non-competitive hits
  • BooleanQuery with no scoring clause can now early terminate the query when the total hits is not requested
  • Matches on wildcard queries will defer building their full disjunction until a MatchesIterator is pulled
  • spatial-extras quad and packed quad prefix trees now index points faster
  • Add additional leaf node level optimizations in LatLonShapeBoundingBoxQuery
  • Improve performance of WITHIN and DISJOINT queries for Shape queries by doing just one pass whenever possible
  • Introduce shared count based early termination across multiple slices
  • Blocktree's seekExact now short-circuits false if the term isn't in the min-max range of the segment. Large perf gain for ID/time like data when populated sequentially
  • Show SPI names instead of class names in Luke Analysis tab
  • GraphTokenStreamFiniteStrings preserves all Token attributes through its finite strings TokenStreams
  • Introduced SpanPositionRange into XML Query Parser
  • Use a sort key instead of true distance in NearestNeighbor
  • Tessellator labels the edges of the generated triangles whether they belong to the original polygon
  • Use exact distance between point and bounding rectangle in FloatPointNearestNeighbor
  • The Korean analyzer now splits tokens on boundaries between digits and alphabetic characters
  • MoreLikeThis is biased for uncommon fields

New in Apache Lucene 8.2.0 (Jul 26, 2019)

  • API Changes:
  • Intervals queries has been moved from the sandbox to the queries module.
  • New Features
  • New XYShape Field and Queries for indexing and querying general cartesian geometries.
  • Snowball stemmer/analyzer for the Estonian language.
  • Provide a FeatureSortfield to allow sorting search hits by descending value of a feature.
  • Add new KoreanNumberFilter that can change Hangul character to number and process decimal point.
  • Add doc-value support to range fields.
  • Add monitor subproject (previously Luwak monitoring library) that allows a stream of documents to be matched against a set of registered queriesin an efficient manner.
  • Add a numeric range query in sandbox that takes advantage of index sorting.Add a numeric range query in sandbox that takes advantage of index sorting.
  • Optimizations
  • Use exponential search instead of binary search in IntArrayDocIdSet#advance method.
  • Use incoming thread for execution if IndexSearcher has an executorNow caller threads execute at least one search on an index even if there is an executor provided to minimize thread context switching.
  • New storing strategy for BKD tree leaves with low cardinality that can lower storage costs and It can be used at search time to speed up queries.
  • Load frequencies lazily only when needed in BlockDocsEnum and BlockImpactsEverythingEnum.
  • Phrase queries now leverage impacts.

New in Apache Lucene 8.1.0 (May 16, 2019)

  • A query introspection API has been introduced.
  • Luke, well-known GUI for inspecting Lucene indexes, now added as a Lucene module
  • Merging dimensional points to use radix partitioning, which has also been optimized
  • Bugfix: LatLonShapePolygonQuery returns incorrect WITHIN results with shared boundaries
  • TieredMergePolicy#findForcedMerges now tries to create the cheapest merges
  • Build point writers in the BKD tree only when they are needed
  • SynonymQuery can now deboost the document frequency of each term when blending synonym scores
  • ConstantScoreQuery can early terminate if minimum score > constant score (total hits are not requested)
  • DateRangePrefixTree can now parse more precise dates

New in Apache Lucene 8.0.0 (Mar 14, 2019)

  • API changes:
  • LUCENE-8662: TermsEnum.seekExact(BytesRef) to abstract and delegate seekExact(BytesRef) in FilterLeafReader.FilterTermsEnum.
  • (Jeffery Yuan via Tomás Fernández Löbbe, Simon Willnauer)
  • LUCENE-8469: Deprecated StringHelper.compare has been removed.
  • (Dawid Weiss)
  • LUCENE-8039: Introduce a "delta distance" method set to GeoDistance. This allows distance calculations, especially for paths, to take into account an "excursion" to include the specified point.
  • LUCENE-8007: Index statistics Terms.getSumDocFreq(), Terms.getDocCount() are now required to be stored by codecs. Additionally, TermsEnum.totalTermFreq() and Terms.getSumTotalTermFreq() are now required: if frequencies are not stored they are equal to TermsEnum.docFreq() and Terms.getSumDocFreq(), respectively, because all freq() values equal 1.
  • (Adrien Grand, Robert Muir)
  • LUCENE-8038: Deprecated PayloadScoreQuery constructors have been removed
  • (Alan Woodward)
  • LUCENE-8014: Similarity.computeSlopFactor() and Similarity.computePayloadFactor() have been removed
  • (Alan Woodward)
  • LUCENE-7996: Queries are now required to produce positive scores.
  • (Adrien Grand)
  • LUCENE-8099: CustomScoreQuery, BoostedQuery and BoostingQuery have been removed
  • (Alan Woodward)
  • LUCENE-8012: Explanation now takes Number rather than float
  • (Alan Woodward, Robert Muir)
  • LUCENE-8116: SimScorer now only takes a frequency and a norm as per-document scoring factors.
  • (Adrien Grand)
  • LUCENE-8113: TermContext has been renamed to TermStates, and can now be constructed lazily if term statistics are not required
  • (Alan Woodward)
  • LUCENE-8242: Deprecated method IndexSearcher#createNormalizedWeight() has been removed
  • (Alan Woodward)
  • LUCENE-8267: Memory codecs removed from the codebase (MemoryPostings, MemoryDocValues).
  • (Dawid Weiss)
  • LUCENE-8144: Moved QueryCachingPolicy.ALWAYS_CACHE to the test framework.
  • (Nhat Nguyen via Adrien Grand)
  • LUCENE-8356: StandardFilter and StandardFilterFactory have been removed
  • (Alan Woodward)
  • LUCENE-8373: StandardAnalyzer.ENGLISH_STOP_WORD_SET has been removed
  • (Alan Woodward)
  • LUCENE-8388: Unused PostingsEnum#attributes() method has been removed
  • (Alan Woodward)
  • LUCENE-8405: TopDocs.maxScore is removed. IndexSearcher and TopFieldCollector no longer have an option to compute the maximum score when sorting by field.
  • (Adrien Grand)
  • LUCENE-8411: TopFieldCollector no longer takes a fillFields option, it now always fills fields.
  • (Adrien Grand)
  • LUCENE-8412: TopFieldCollector no longer takes a trackDocScores option. Scores need to be set on top hits via TopFieldCollector#populateScores instead.
  • (Adrien Grand)
  • LUCENE-6228: A new Scorable abstract class has been added, containing only those methods from Scorer that should be called from Collectors. LeafCollector.setScorer() now takes a Scorable rather than a Scorer.
  • (Alan Woodward, Adrien Grand)
  • LUCENE-8475: Deprecated constants have been removed from RamUsageEstimator.
  • (Dimitrios Athanasiou)
  • LUCENE-8483: Scorers may no longer take null as a Weight
  • (Alan Woodward)
  • LUCENE-8352: TokenStreamComponents is now final, and can take a Consumer<Reader> in its constructor
  • (Mark Harwood, Alan Woodward, Adrien Grand)
  • LUCENE-8498: LowerCaseTokenizer has been removed, and CharTokenizer no longer takes a normalizer function.
  • (Alan Woodward)
  • LUCENE-7875: Moved MultiFields static methods out of the class. getLiveDocs is now in MultiBits which is now public. getMergedFieldInfos and getIndexedFields are now in FieldInfos. getTerms is now in MultiTerms. getTermPositionsEnum and getTermDocsEnum were collapsed and renamed to just getTermPostingsEnum and moved to MultiTerms.
  • (David Smiley)
  • LUCENE-8513: MultiFields.getFields is now removed. Please avoid this class, and Fields in general, when possible.
  • (David Smiley)
  • LUCENE-8497: MultiTermAwareComponent has been removed, and in its place TokenFilterFactory and CharFilterFactory now expose type-safe normalize() methods. This decouples normalization from tokenization entirely.
  • (Mayya Sharipova, Alan Woodward)
  • LUCENE-8597: IntervalIterator now exposes a gaps() method that reports the number of gaps between its component sub-intervals. This can be used in a new filter available via Intervals.maxgaps().
  • (Alan Woodward)
  • LUCENE-8609: Remove IndexWriter#numDocs() and IndexWriter#maxDoc() in favor of IndexWriter#getDocStats().
  • (Simon Willnauer)
  • LUCENE-8292: Make TermsEnum fully abstract.
  • (Simon Willnauer)
  • Changes in runtime behavior:
  • LUCENE-8333: Switch MoreLikeThis.setMaxDocFreqPct to use maxDoc instead of numDocs.
  • (Robert Muir, Dawid Weiss).
  • LUCENE-7837: Indices that were created before the previous major version will now fail to open even if they have been merged with the previous major version.
  • (Adrien Grand)
  • LUCENE-8020: Similarities are no longer passed terms that don't exist by queries such as SpanOrQuery, so scoring formulas no longer require divide-by-zero hacks. IndexSearcher.termStatistics/collectionStatistics return null instead of returning bogus values for a non-existent term or field.
  • (Robert Muir)
  • LUCENE-7996: FunctionQuery and FunctionScoreQuery now return a score of 0 when the function produces a negative value.
  • (Adrien Grand)
  • LUCENE-8116: Similarities now score fields that omit norms as if the norm was 1. This might change score values on fields that omit norms.
  • (Adrien Grand)
  • LUCENE-8134: Index options are no longer automatically downgraded.
  • (Adrien Grand)
  • LUCENE-8031: Length normalization correctly reflects omission of term frequencies.
  • (Robert Muir, Adrien Grand)
  • LUCENE-7444: StandardAnalyzer no longer defaults to removing English stopwords
  • (Alan Woodward)
  • LUCENE-8060: IndexSearcher's search and searchAfter methods now only compute total hit counts accurately up to 1,000 in order to enable top-hits optimizations such as block-max WAND (LUCENE-8135).
  • (Adrien Grand)
  • LUCENE-8505: IndexWriter#addIndices will now fail if the target index is sorted but the candidate is not.
  • (Jim Ferenczi)
  • LUCENE-8535: Highlighter and FVH doesn't support ToParent and ToChildBlockJoinQuery out of the box anymore. In order to highlight on Block-Join Queries a custom WeightedSpanTermExtractor / FieldQuery should be used.
  • (Simon Willnauer, Jim Ferenczi, Julie Tibshiran)
  • LUCENE-8563: BM25 scores don't include the (k1+1) factor in their numerator anymore. This doesn't affect ordering as this is a constant factor which is the same for every document.
  • (Luca Cavanna via Adrien Grand)
  • LUCENE-8509: WordDelimiterGraphFilter will no longer set the offsets of internal tokens by default, preventing a number of bugs when the filter is chained with tokenfilters that change the length of their tokens
  • (Alan Woodward)
  • LUCENE-8633: IntervalQuery scores do not use term weighting any more, the score is instead calculated as a function of the sloppy frequency of the matching intervals.
  • (Alan Woodward, Jim Ferenczi)
  • LUCENE-8635: FSTs can now remain off-heap, accessed via IndexInput, and the default codec's term dictionary (BlockTreeTermsReader) will now leave the FST for the terms index off-heap for non-primary-key fields using MMapDirectory, reducing heap usage for such fields.
  • (Ankit Jain)
  • New features:
  • LUCENE-8340: LongPoint#newDistanceFeatureQuery may be used to boost scores based on how close a value of a long field is from an configurable origin. This is typically useful to boost by recency.
  • (Adrien Grand)
  • LUCENE-8482: LatLonPoint#newDistanceFeatureQuery may be used to boost scores based on the haversine distance of a LatLonPoint field to a provided point. This is typically useful to boost by distance.
  • (Ignacio Vera)
  • LUCENE-8216: Added a new BM25FQuery in sandbox to blend statistics across several fields using the BM25F formula.
  • (Adrien Grand, Jim Ferenczi)
  • LUCENE-8564: GraphTokenFilter is an abstract class useful for token filters that need to read-ahead in the token stream and take into account graph structures. This also changes FixedShingleFilter to extend GraphTokenFilter
  • (Alan Woodward)
  • LUCENE-8612: Intervals.extend() treats an interval as if it covered a wider span than it actually does, allowing users to force minimum gaps between intervals in a phrase.
  • (Alan Woodward)
  • LUCENE-8629: New interval functions: Intervals.before(), Intervals.after(), Intervals.within() and Intervals.overlapping().
  • (Alan Woodward)
  • LUCENE-8622: Adds a minimum-should-match interval function that produces intervals spanning a subset of a set of sources.
  • (Alan Woodward)
  • LUCENE-8645: Intervals.fixField() allows you to report intervals from one field as if they came from another.
  • (Alan Woodward)
  • LUCENE-8646: New interval functions: Intervals.prefix() and Intervals.wildcard()
  • (Alan Woodward)
  • LUCENE-8655: Add a getter in FunctionScoreQuery class in order to access to the underlying DoubleValuesSource.
  • (Gérald Quaire via Alan Woodward)
  • LUCENE-8697: GraphTokenStreamFiniteStrings correctly handles side paths containing gaps
  • (Alan Woodward)
  • LUCENE-8702: Simplify intervals returned from vararg Intervals factory methods
  • (Alan Woodward)
  • Improvements:
  • LUCENE-7997: Add BaseSimilarityTestCase to sanity check similarities. SimilarityBase switches to 64-bit doubles internally to help avoid common numeric issues. Add missing range checks for similarity parameters. Improve BM25 and ClassicSimilarity's explanations.
  • (Robert Muir)
  • LUCENE-8011: Improved similarity explanations.
  • (Mayya Sharipova via Adrien Grand)
  • LUCENE-4198: Codecs now have the ability to index score impacts.
  • (Adrien Grand)
  • LUCENE-8135: Boolean queries now implement the block-max WAND algorithm in order to speed up selection of top scored documents.
  • (Adrien Grand)
  • LUCENE-8279: CheckIndex now cross-checks terms with norms.
  • (Adrien Grand)
  • LUCENE-8660: TopDocsCollectors now return an accurate count (instead of a lower bound) if the total hit count is equal to the provided threshold.
  • (Adrien Grand, Jim Ferenczi)
  • Optimizations:
  • LUCENE-8040: Optimize IndexSearcher.collectionStatistics, avoiding MultiFields/MultiTerms
  • (David Smiley, Robert Muir)
  • LUCENE-4100: Disjunctions now support faster collection of top hits when the total hit count is not required.
  • (Stefan Pohl, Adrien Grand, Robert Muir)
  • LUCENE-7993: Phrase queries are now faster if total hit counts are not required.
  • (Adrien Grand)
  • LUCENE-8109: Boolean queries propagate information about the minimum competitive score in order to make collection faster if there are disjunctions or phrase queries as sub queries, which know how to leverage this information to run faster.
  • (Adrien Grand)
  • LUCENE-8439: Disjunction max queries can skip blocks to select the top documents if the total hit count is not required.
  • (Jim Ferenczi, Adrien Grand)
  • LUCENE-8204: Boolean queries with a mix of required and optional clauses are now faster if the total hit count is not required.
  • (Jim Ferenczi, Adrien Grand)
  • LUCENE-8448: Boolean queries now propagates the mininum score to their sub-scorers.
  • (Jim Ferenczi, Adrien Grand)
  • LUCENE-8511: MultiFields.getIndexedFields is now optimized; does not call getMergedFieldInfos
  • (David Smiley)
  • LUCENE-8507: TopFieldCollector can now update the minimum competitive score if the primary sort is by relevancy and the total hit count is not required.
  • (Jim Ferenczi)
  • LUCENE-8464: ConstantScoreScorer now implements setMinCompetitveScore in order to early terminate the iterator if the minimum score is greater than the constant score.
  • (Christophe Bismuth via Jim Ferenczi)
  • LUCENE-8607: MatchAllDocsQuery can shortcut when total hit count is not required
  • (Alan Woodward, Adrien Grand)
  • LUCENE-8585: Index-time jump-tables for DocValues, for O(1) advance when retrieving doc values.
  • (Toke Eskildsen, Adrien Grand)

New in Apache Lucene 7.7.0 (Feb 11, 2019)

  • Fix LatLonShape WITHIN queries that fail with Multiple search Polygons that share the dateline.
  • LatLonShape's within and disjoint queries can return false positives with indexed multi-shapes.
  • ExitableDirectoryReader may now time out queries that run on points such as range queries or geo queries.
  • StandardTokenizer and UAX29URLEmailTokenizer now support Unicode 9.0, and provide Unicode UTS#51 v11.0 Emoji tokenization with the "" token type.
  • TopFieldCollector can now early-terminates queries when sorting by SortField.DOC.
  • Speed up merging segments of points with data dimensions by only sorting on the indexed dimensions.
  • The KoreanTokenizer no longer splits unknown words on combining diacritics and detects script boundaries more accurately with Character#UnicodeScript#of.
  • Change LatLonShape encoding to use 4 bytes Per Dimension.
  • BufferedUpdates now uses an optimized storage for buffering docvalues updates that can save up to 80% of the heap used compared to the previous implementation and uses non-object based datastructures.
  • Moved to the default accepted overhead ratio for packet ints in DocValuesFieldUpdates yields an up-to 4x performance improvement when applying doc values updates.
  • Doc-value updates get applied faster by sorting with quicksort, rather than an in-place mergesort, which needs to perform fewer swaps.
  • Decrease I/O pressure when merging high dimensional points.

New in Apache Lucene 7.6.0 (Dec 17, 2018)

  • Index sorting corruption due to numeric overflow has been fixed. Indices affected by this bug can be detected by running the CheckIndex command on a 7.6+ release distribution.
  • Better tessellation processing of Polygons including graceful exceptions for detecting invalid shapes.
  • Points codec now supports;the ability to designate dimensions as as "data only" dimensions that do not affect construction of the index.
  • New Simple WKT Shape Parser builds lucene geometries (polygons, lines, rectangles) from WKT format.
  • New LatLonShapeLineQuery queries indexed shapes with arbitrary lines.
  • analyzeGraphPhrase query builder creates one phrase query per finite strings in the graph based on slop parameter.
  • Performance in PerFieldMergeState#FilterFieldInfos has been improved from O(N) to O(1) lookup time.

New in Apache Lucene 7.5.0 (Sep 24, 2018)

  • IndexWriter#deleteDocs(Query... query) applies deletes to wrong documents if the index is sorted.
  • TieredMergePolicy now respects maxSegmentSizeMB by default when executing findForcedMerges and findForcedDeletesMerges.
  • A new points based Shape Indexing and Searching that decomposes shapes into a triangular mesh and indexes individual triangles as a 6 dimension point.
  • A new ByteBuffer based Directory implementation that aims to replace the deprecated RAMDirectory.
  • The UnifiedHighlighter can now use the MatchesIterator API to highlight any query more accurately.
  • TopFieldComparator can now stop comparing documents if the index is sorted, even if hits still need to be visited to compute the hit count.
  • TieredMergePolicy can control how aggressively deletes should be reclaimed with the new deletesPctAllowed setting.

New in Apache Lucene 7.4.0 (Jun 27, 2018)

  • Upgrading:
  • If you are using the AnalyzingSuggester or FuzzySuggester subclass, and if you explicitly use the preservePositionIncrements=false setting (not the default), then you ought to rebuild your suggester index. If you don't, queries or indexed data with trailing position gaps (e.g. stop words) may not work correctly.
  • (David Smiley, Jim Ferenczi)
  • API Changes:
  • LUCENE-8242: IndexSearcher.createNormalizedWeight() has been deprecated. Instead use IndexSearcher.createWeight(), rewriting the query first.
  • LUCENE-8248: MergePolicyWrapper is renamed to FilterMergePolicy and now also overrides getMaxCFSSegmentSizeMB)
  • LUCENE-8303: LiveDocsFormat is now only responsible for (de)serialization of live docs.
  • Changes in Runtime Behavior:
  • LUCENE-8309: Live docs are no longer backed by a FixedBitSet.
  • LUCENE-8330: Detach IndexWriter from MergePolicy. MergePolicy now instead of requiring IndexWriter as a hard dependency expects a MergeContext which IndexWriter implements.
  • New Features:
  • LUCENE-8200: Allow doc-values to be updated atomically together with a document. Doc-Values updates now can be used as a soft-delete mechanism to all keeping several version of a document or already deleted documents around for later reuse. See "IW.softUpdateDocument(...)" for reference.
  • (Simon Willnauer)
  • LUCENE-8197: A new FeatureField makes it easy and efficient to integrate static relevance signals into the final score.
  • (Adrien Grand, Robert Muir)
  • LUCENE-8202: Add a FixedShingleFilter
  • (Alan Woodward, Adrien Grand, Jim Ferenczi)
  • LUCENE-8125: ICUTokenizer support for emoji/emoji sequence tokens.
  • (Robert Muir)
  • LUCENE-8196, LUCENE-8300: A new IntervalQuery in the sandbox allows efficient proximity searches based on minimum-interval semantics.
  • (Alan Woodward, Adrien Grand, Jim Ferenczi, Simon Willnauer, Matt Weber)
  • LUCENE-8233: Add support for soft deletes to IndexWriter delete accounting. Soft deletes are accounted for inside the index writer and therefor also by merge policies. A SoftDeletesRetentionMergePolicy is added that allows to selectively carry over soft_deleted document across merges for retention policies
  • (Simon Willnauer, Mike McCandless, Robert Muir)
  • LUCENE-8237: Add a SoftDeletesDirectoryReaderWrapper that allows to respect soft deletes if the reader is opened form a directory.
  • (Simon Willnauer, Mike McCandless, Uwe Schindler, Adrien Grand)
  • LUCENE-8229, LUCENE-8270: Add a method Weight.matches(LeafReaderContext, doc) that returns an iterator over matching positions for a given query and document. This allows exact hit extraction and will enable implementation of accurate highlighters.
  • (Alan Woodward, Adrien Grand, David Smiley)
  • LUCENE-8249: Implement Matches API for phrase queries
  • (Alan Woodward, Adrien Grand)
  • LUCENE-8246: Allow to customize the number of deletes a merge claims. This helps merge policies in the soft-delete case to correctly implement retention policies without triggering uncessary merges.
  • (Simon Willnauer, Mike McCandless)
  • LUCENE-8231: A new analysis module (nori) similar to Kuromoji but to handle Korean using mecab-ko-dic and morphological analysis.
  • (Robert Muir, Jim Ferenczi)
  • LUCENE-8265: WordDelimter/GraphFilter now have an option to skip tokens marked with KeywordAttribute
  • (Mike Sokolov via Mike McCandless)
  • LUCENE-8297: Add IW#tryUpdateDocValues(Reader, int, Fields...) IndexWriter can update doc values for a specific term but this might affect all documents containing the term. With tryUpdateDocValues users can update doc-values fields for individual documents. This allows for instance to soft-delete individual documents.
  • (Simon Willnauer)
  • LUCENE-8298: Allow DocValues updates to reset a value. Passing a DV field with a null value to IW#updateDocValues or IW#tryUpdateDocValues will now remove the value from the provided document. This allows to undelete a soft-deleted document unless it's been claimed by a merge.
  • (Simon Willnauer)
  • LUCENE-8273: ConditionalTokenFilter allows analysis chains to skip particular token filters based on the attributes of the current token. This generalises the keyword token logic currently used for stemmers and WDF. It is integrated into CustomAnalyzer by using the `when` and `whenTerm` builder methods, and a new ProtectedTermFilter is added as an example.
  • (Alan Woodward, Robert Muir, David Smiley, Steve Rowe, Mike Sokolov)
  • LUCENE-8310: Ensure IndexFileDeleter accounts for pending deletes. Today we fail creating the IndexWriter when the directory has a pending delete. Yet, this is mainly done to prevent writing still existing files more than once. IndexFileDeleter already accounts for that for existing files which we can now use to also take pending deletes into account which ensures that all file generations per segment always go forward.
  • (Simon Willnauer)
  • LUCENE-7960: Add preserveOriginal option to the NGram and EdgeNGram filters.
  • (Ingomar Wesp, Shawn Heisey via Robert Muir)
  • LUCENE-8335: Enforce soft-deletes field up-front. Soft deletes field must be marked as such once it's introduced and can't be changed after the fact.
  • (Nhat Nguyen via Simon Willnauer)
  • LUCENE-8332: New ConcatenateGraphFilter for concatenating all tokens into one (or more in the event of a graph input). This is useful for fast analyzed exact-match lookup, suggesters, and as a component of a named entity recognition system. This was excised out of CompletionTokenStream in the NRT doc suggester.
  • (David Smiley, Jim Ferenczi)
  • Bug Fixes:
  • LUCENE-8221: MoreLikeThis.setMaxDocFreqPct can easily int-overflow on larger indexes.
  • LUCENE-8266: Detect bogus tiles when creating a standard polygon and throw a TileException.
  • (Ignacio Vera)
  • LUCENE-8234: Fixed bug in how spatial relationship is computed for GeoStandardCircle when it covers the whole world.
  • (Ignacio Vera)
  • LUCENE-8236: Filter duplicated points when creating GeoPath shapes to avoid creation of bogus planes.
  • (Ignacio Vera)
  • LUCENE-8243: IndexWriter.addIndexes(Directory[]) did not properly preserve index file names for updated doc values fields
  • (Simon Willnauer, Michael McCandless, Nhat Nguyen)
  • LUCENE-8275: Push up #checkPendingDeletes to Directory to ensure IW fails if the directory has pending deletes files even if the directory is filtered or a FileSwitchDirectory
  • (Simon Willnauer, Robert Muir)
  • LUCENE-8244: Do not leak open file descriptors in SearcherTaxonomyManager's refresh on exception
  • (Mike McCandless)
  • LUCENE-8305: ComplexPhraseQuery.rewrite now handles an embedded MultiTermQuery that rewrites to a MatchNoDocsQuery instead of throwing an exception.
  • (Bjarke Mortensen, Andy Tran via David Smiley)
  • LUCENE-8287: Ensure that empty regex completion queries always return no results.
  • (Julie Tibshirani via Jim Ferenczi)
  • LUCENE-8317: Prevent concurrent deletes from being applied during full flush. Future deletes could potentially be exposed to flushes/commits/refreshes if the amount of RAM used by deletes is greater than half of the IW RAM buffer.
  • (Simon Willnauer)
  • LUCENE-8320: Fix WindowsFS to correctly account for rename and hardlinks.
  • (Simon Willnauer, Nhat Nguyen)
  • LUCENE-8328: Ensure ReadersAndUpdates consistently executes under lock.
  • (Nhat Nguyen via Simon Willnauer)
  • LUCENE-8325: Fixed the smartcn tokenizer to not split UTF-16 surrogate pairs.
  • (chengpohi via Jim Ferenczi)
  • LUCENE-8186: LowerCaseTokenizerFactory now lowercases text in multi-term queries.
  • (Tim Allison via Adrien Grand)
  • LUCENE-8278: Some end-of-input no-scheme domain-only URL tokens are typed as <ALPHANUM> rather than <URL>.
  • (Junte Zhang, Steve Rowe)
  • LUCENE-8355: Prevent IW from opening an already dropped segment while DV updates are written.
  • (Nhat Nguyen via Simon Willnauer)
  • LUCENE-8344: TokenStreamToAutomaton (used by some suggesters) was not ignoring a trailing position increment when the preservePositionIncrement setting is false.
  • (David Smiley, Jim Ferenczi)
  • LUCENE-8357: FunctionScoreQuery.boostByQuery() and boostByValue() were producing truncated Explanations
  • (Markus Jelsma, Alan Woodward)
  • LUCENE-8360: NGramTokenFilter and EdgeNGramTokenFilter did not correctly set position increments in end()
  • (Alan Woodward)
  • Other:
  • LUCENE-8301: Update randomizedtesting to 2.6.0.
  • (Dawid Weiss)
  • LUCENE-8299: Geo3D wrapper uses new polygon method factory that gives better support for polygons with many points (>100).
  • (Ignacio vera)
  • LUCENE-8261: InterpolatedProperties.interpolate and recursive property references.
  • (Steve Rowe, Dawid Weiss)
  • LUCENE-8228: removed obsolete IndexDeletionPolicy clone() requirements from the javadoc.
  • (Dawid Weiss)
  • LUCENE-8219: Use a realistic estimate of the number of nodes and links in LevensteinAutomaton.java, to save reallocation of arrays.
  • (Christian Ziech)
  • LUCENE-8214: Improve selection of testPoint for GeoComplexPolygon.
  • (Ignacio Vera)
  • SOLR-10912: Add automatic patch validation.
  • (Mano Kovacs, Steve Rowe)
  • LUCENE-8122, LUCENE-8175: Upgrade analysis/icu to ICU 61.1.
  • (Robert Muir, Adrien Grand, Uwe Schindler)
  • LUCENE-8291: Remove QueryTemplateManager utility class from XML queryparser. This class is just a general XML transforming tool (using property files and XSLT) and has nothing to do with query parsing. It can easily be implemented using more sophisticated libraries or using XSL transformers from the JDK. This change also removes the Lucene demo webapp to prevent XSS issues in untested/unmaintained code.
  • (Uwe Schindler)

New in Apache Lucene 7.3.1 (May 15, 2018)

  • The Lucene PMC is pleased to announce the release of Apache Lucene 7.3.1.
  • Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
  • This release contains one build change. The release is available for immediate download at:
  • http://lucene.apache.org/core/mirrors-core-redir.html
  • Please read CHANGES.txt for a full list of changes:
  • https://lucene.apache.org/core/7_3_1/changes/Changes.html

New in Apache Lucene 7.3.0 (Apr 5, 2018)

  • Fixed bug in how CheckIndex verifies doc-value iterators.

New in Apache Lucene 7.2.1 (Jan 15, 2018)

  • Fix advanceExact on SortedNumericDocValues produced by Lucene54DocValuesProducer

New in Apache Lucene 7.2.0 (Dec 22, 2017)

  • API Changes:
  • LUCENE-8039: Introduce a "delta distance" method set to GeoDistance. This allows distance calculations, especially for paths, to take into account an "excursion" to include the specified point.
  • LUCENE-8017, LUCENE-8042: Weight, DoubleValuesSource and related objects now implement a SegmentCacheable interface, with a single method isCacheable(LeafReaderContext) determining whether or not the object may be cached against a LeafReader.
  • (Alan Woodward, Robert Muir)
  • LUCENE-8038: Payload factors for scoring in PayloadScoreQuery are now calculated by a PayloadDecoder, instead of delegating to the Similarity.
  • (Alan Woodward)
  • LUCENE-8014: Similarity.computeSlopFactor() and Similarity.computePayloadFactor() have been deprecated.
  • (Alan Woodward)
  • LUCENE-6278: Scorer.freq() has been removed
  • (Alan Woodward)
  • LUCENE-7736: DoubleValuesSource and LongValuesSource now expose a rewrite(IndexSearcher) function.
  • (Alan Woodward)
  • LUCENE-7998: DoubleValuesSource.fromQuery() allows you to use the scores from a Query as a DoubleValuesSource.
  • (Alan Woodward)
  • LUCENE-8049: IndexWriter.getMergingSegments()'s return type was changed from Collection to Set to more accurately reflect it's nature.
  • (David Smiley)
  • LUCENE-8059: TopFieldDocCollector can now early terminate collection when the sort order is compatible with the index order. As a consequence, EarlyTerminatingSortingCollector is now deprecated.
  • New Features:
  • LUCENE-8061: Add convenience factory methods to create BBoxes and XYZSolids directly from bounds objects.
  • LUCENE-7736: IndexReaderFunctions expose various IndexReader statistics as DoubleValuesSources.
  • (Alan Woodward)
  • LUCENE-8068: Allow IndexWriter to write a single DWPT to disk Adds a flushNextBuffer method to IndexWriter that allows the caller to synchronously move the next pending or the biggest non-pending index buffer to disk. This enables flushing selected buffer to disk without highjacking an indexing thread. This is for instance useful if more than one IW (shards) must be maintained in a single JVM / system.
  • (Simon Willnauer)
  • Bug Fixes:
  • LUCENE-8076: Normalize Vincenti distance calculation for planet models that aren't normalized.
  • (Ignacio Vera)
  • LUCENE-8057: Exact circle bounds computation was incorrect.
  • (Ignacio Vera)
  • LUCENE-8056: Exact circle segment bounding suffered from precision errors.
  • (Karl Wright)
  • LUCENE-8054: Fix the exact circle case where relationships fail when the planet model has c <= ab, because the planes are constructed incorrectly.
  • (Ignacio Vera)
  • LUCENE-7991: KNearestNeighborDocumentClassifier.knnSearch no longer applies a previous boosted field's factor to subsequent unboosted fields.
  • (Christine Poerschke)
  • LUCENE-7999: Switch from int to long to track the name for the next segment to write, so that very long lived indices with very frequent refreshes or commits, and high indexing thread counts, do not overflow an int
  • (Mykhailo Demianenko via Mike McCandless)
  • LUCENE-8025: Use sumTotalTermFreq=sumDocFreq when scoring DOCS_ONLY fields that omit term frequency information, as it is equivalent in that case. Previously bogus numbers were used, and many similarities would completely degrade.
  • (Robert Muir, Adrien Grand)
  • LUCENE-8045: ParallelLeafReader did not correctly report FieldInfo.dvGen
  • (Alan Woodward)
  • LUCENE-8034: Use subtraction instead of addition to sidestep int overflow in SpanNotQuery.
  • (Hari Menon via Mike McCandless)
  • LUCENE-8078: The query cache should not cache instances of MatchNoDocsQuery.
  • (Jon Harper via Adrien Grand)
  • LUCENE-8048: Filesystems do not guarantee order of directories updates
  • (Nikolay Martynov, Simon Willnauer, Erick Erickson)
  • Optimizations:
  • LUCENE-8018: Smaller FieldInfos memory footprint by not retaining unnecessary references to TreeMap entries.
  • (Julian Vassev via Adrien Grand)
  • LUCENE-7994: Use int/int scatter map to gather facet counts when the number of hits is small relative to the number of unique facet labels
  • (Dawid Weiss, Robert Muir, Mike McCandless)
  • LUCENE-8062: GlobalOrdinalsQuery is no longer eligible for caching.
  • (Jim Ferenczi)
  • LUCENE-8058: Large instances of TermInSetQuery are no longer eligible for caching as they could break memory accounting of the query cache.
  • (Adrien Grand)
  • LUCENE-8055: MemoryIndex.MemoryDocValuesIterator returns 2 documents instead of 1.
  • (Simon Willnauer)
  • LUCENE-8043: Fix document accounting in IndexWriter to prevent writing too many documents. Once this happens, Lucene refuses to open the index and throws a CorruptIndexException.
  • (Simon Willnauer, Yonik Seeley, Mike McCandless)
  • Tests:
  • LUCENE-8035: Run tests with JDK-specific options: --illegal-access=deny on Java 9+.
  • (Uwe Schindler)
  • Build:
  • LUCENE-6144: Upgrade Ivy to 2.4.0; 'ant ivy-bootstrap' now removes old Ivy jars in ~/.ant/lib/.
  • (Shawn Heisey, Steve Rowe)

New in Apache Lucene 7.1.0 (Oct 17, 2017)

  • Changed in runtime behavior:
  • Resolving of external entities in queryparser/xml/CoreParser is disallowed by default. See SOLR-11477 for details.
  • New features:
  • LUCENE-7970: Add a shape to Geo3D that consists of multiple planes that approximate a true circle, rather than an ellipse, for non-spherical planet models. (Karl Wright, Ignacio Vera)
  • LUCENE-7955: Add support for the concept of "nearest distance" to Geo3D's GeoPath abstraction, which is the distance along the path to the point that is closest to the provided point. (Karl Wright)
  • LUCENE-7906: Add spatial relationships between all currently-defined Geo shapes (Ignacio Vera)
  • LUCENE-7955: Add support for zero-width paths. (Karl Wright)
  • LUCENE-7936: Add serialization and deserialization support to Geo3D. (Karl Wright, Ignacio Vera)
  • LUCENE-7942: Distance computations now have the ability to accurately aggregate distances, rather than just doing sums. (Karl Wright)
  • LUCENE-7934: Add a planet model interface. (Karl Wright)
  • LUCENE-7918: Revamp the API for composites so that it's generic and can be used for many kinds of shapes. (Ignacio Vera)
  • LUCENE-7621: Add CoveringQuery, a query whose required number of matching clauses can be defined per document. (Adrien Grand)
  • LUCENE-7927: Add LongValueFacetCounts, to compute facet counts for individual numeric values (Mike McCandless)
  • LUCENE-7940: Add BengaliAnalyzer. (Md. Abdulla-Al-Sun via Robert Muir)
  • LUCENE-7392: Add point based LatLonBoundingBox as new RangeField Type. (Nick Knize)
  • LUCENE-7951: Spatial-extras has much better Geo3d support by implementing Spatial4j abstractions: SpatialContextFactory, ShapeFactory, BinaryCodec, DistanceCalculator (Ignacio Vera, David Smiley)
  • LUCENE-7973: Update dictionary version for Ukrainian analyzer to 3.9.0 (Andriy Rysin via Dawid Weiss)
  • LUCENE-7974: Add FloatPointNearestNeighbor, an N-dimensional FloatPoint K-nearest-neighbor search implementation. (Steve Rowe)
  • LUCENE-7975: Change the default taxonomy facets cache to a faster byte[] (UTF-8) based cache. (Mike McCandless)
  • LUCENE-7972: DirectoryTaxonomyReader, in Lucene's facet module, now implements Accountable, so you can more easily track how much heap it's using. (Mike McCandless)
  • LUCENE-7982: A new NormsFieldExistsQuery matches documents that have norms in a specified field (Colin Goodheart-Smithe via Mike McCandless)
  • Optimizations:
  • LUCENE-7905: Optimize how OrdinalMap (used by SortedSetDocValuesFacetCounts and others) builds its map (Robert Muir, Adrien Grand, Mike McCandless)
  • LUCENE-7655: Speed up geo-distance queries in case of dense single-valued fields when most documents match. (Maciej Zasada via Adrien Grand)
  • LUCENE-7897: IndexOrDocValuesQuery now requires the range cost to be more than 8x greater than the cost of the lead iterator in order to use doc values. (Murali Krishna P via Adrien Grand)
  • LUCENE-7925: Collapse duplicate SHOULD or MUST clauses by summing up their boosts.
  • (Adrien Grand)
  • LUCENE-7939: MinShouldMatchSumScorer now leverages two-phase iteration in order to be faster when used in conjunctions. (Adrien Grand)
  • LUCENE-7827: AnalyzingInfixSuggester doesn't create "textgrams" when minPrefixChar=0 (Mikhail Khludnev)
  • Bug fixes:
  • LUCENE-7967: The GeoDegeneratePoint isWithin() method needed allowance for numerical precision. (Karl Wright)
  • LUCENE-7965: GeoBBoxFactory was constructing the wrong shape at the poles if the longitude span was greater than 180 degrees. (Karl Wright)
  • LUCENE-7916: Prevent ArrayIndexOutOfBoundsException if ICUTokenizer is used with a different ICU JAR version than it is compiled against. Note, this is not recommended, lucene-analyzers-icu contains binary data structures specific to ICU/Unicode versions it is built against. (Chris Koenig, Robert Muir)
  • LUCENE-7891: Lucene's taxonomy facets now uses a non-buggy LRU cache by default. (Jan-Willem van den Broek via Mike McCandless)
  • LUCENE-7959: Improve NativeFSLockFactory's exception message if it cannot create write.lock for an empty index due to bad permissions/read-only filesystem/etc. (Erick Erickson, Shawn Heisey, Robert Muir)
  • LUCENE-7968: AnalyzingSuggester would sometimes order suggestions incorrectly, it did not properly break ties on the surface forms when both the weights and the analyzed forms were equal. (Robert Muir)
  • LUCENE-7957: ConjunctionScorer.getChildren was failing to return all child scorers (Adrien Grand, Mike McCandless)
  • SOLR-11477: Disallow resolving of external entities in queryparser/xml/CoreParser by default. (Michael Stepankin, Olga Barinova, Uwe Schindler, Christine Poerschke)
  • Build:
  • SOLR-11181: Switch order of maven artifact publishing procedure: deploy first instead of locally installing first, to workaround a double repository push of *-sources.jar and *-javadoc.jar files. (Lynn Monson via Steve Rowe)
  • LUCENE-6673: Maven build fails for target javadoc:jar. (Ramkumar Aiyengar, Daniel Collins via Steve Rowe)
  • LUCENE-7985: Upgrade forbiddenapis to 2.4.1. (Uwe Schindler)
  • Other:
  • LUCENE-7948, LUCENE-7937: Upgrade randomizedtesting to 2.5.3 (minor fixes in test filtering for IDEs). (Mike Sokolov, Dawid Weiss)
  • LUCENE-7933: LongBitSet now validates the numBits parameter (Won Jonghoon, Mike McCandless)
  • LUCENE-7978: Add some more documentation about setting up build environment. (Anton R. Yuste via Uwe Schindler)
  • LUCENE-7983: IndexWriter.IndexReaderWarmer is now a functional interface instead of an abstract class with a single method (Dawid Weiss)
  • LUCENE-5753: Update TLDs recognized by UAX29URLEmailTokenizer. (Steve Rowe)

New in Apache Lucene 7.0.1 (Oct 8, 2017)

  • ConjunctionScorer.getChildren was failing to return all child scorers

New in Apache Lucene 7.0.0 (Sep 21, 2017)

  • Doc values switched from random access to iterators.
  • The 7.0 codec now sparsely encodes sparse doc values and length normalization factors ("norms"), which also translates to optimization in both indexing, and search on sparse values. With these changes, you finally only pay for what you actually use with doc values, in index size, indexing performance, etc.
  • Index time boost for documents is now removed.
  • Substantial performance gains for delete and update heavy Lucene usage; see http://blog.mikemccandless.com/2017/07/lucene-gets-concurrent-deletes-and.html for details
  • Query scoring is now simpler with removal of coord factor, and query normalization.
  • Classic query parser no longer splits on whitespaces. This enables better multi-word synonym support.
  • The version of Lucene that created the index segment would be recorded, along with the version that last modified the index.
  • IndexWriter, used to add, update and delete documents in your index, will no longer accept broken token offsets sometimes produced by mis-behaving token filters.
  • IndexReader exposes methods that are typically used to manage resources whose lifetime needs to mimic the lifetime of segments/indexes, typically caches. They have been made much less trappy.
  • The dimensional points API now takes a field name up front to offer per-field points access, matching how the doc values APIs work.
  • The PostingsHighlighter was removed. Migrating to the UnifiedHighlighter should be straight-forward.

New in Apache Lucene 6.6.1 (Sep 8, 2017)

  • Documents with multiple points that should match might not match on a memory index
  • A query which has only one synonym with AND as the default operator would wrongly translate as an AND between the query term and the synonym

New in Apache Lucene 6.6.0 (Sep 6, 2017)

  • A concurrent SortedSet facets implementation
  • spatial-extras HeatmapFacetCounter will now short-circuit it's work when Bits.MatchNoBits is passed
  • OfflineSorter now passes the total number of items it will write to getWriter()
  • Move dictionary for Ukrainian analyzer to external dependency
  • SortedSetDocValuesReaderState now implements Accountable so one can see how much RAM it is using
  • OfflineSorter can now run concurrently if you pass it an optional ExecutorService Sorted set facets now use sparse storage when collecting hits, when appropriate
  • PostingsHighlighter has been deprecated in favour of the UnifiedHighlighter

New in Apache Lucene 6.5.1 (Apr 28, 2017)

  • Bug fixes:
  • Fixed join queries to not reference IndexReaders, as it could cause leaks if they are cached.
  • Made LRUQueryCache delegate the scoreSupplier method.
  • Fixed index sorting to work with sparse numeric and binary docvalues field

New in Apache Lucene 6.5.0 (Apr 28, 2017)

  • It is now possible filter out duplicates in the NRT suggester
  • SimpleQueryString now supports default fuziness
  • IndexWriter can return the list of visible field names
  • DisjunctionScorer now supports returning the matching children clauses
  • A new FunctionScoreQuery that modifies the internal query's score using the per-document values
  • A new FunctionMatchQuery that returns any documents with a value that matches a predicate
  • A new WordDelimiterGraphFilter that outputs a correct graph structure for multi-token expansion at query time
  • A new PatternTokenizer that uses Lucene's RegExp implementation
  • RangeFieldQuery now supports CROSSES relation
  • A new IndexOrDocValuesQuery that uses either an index (points or terms) or doc values in order to run a (range, geo box and distance) query, depending which one is more efficient
  • index-time boosts are deprecated
  • Term filters are no longer cached
  • Compound filters are cached earlier than regular queries
  • BKDReader now calls grow on larger increments
  • LatLonPointInPolygonQuery are faster
  • LatLonPointDistanceQuery now skips distance computations more often
  • To-parent block joins now implements two-phase iteration
  • Point ranges that match most documents are faster
  • PointValues#estimatePointCount is faster with Relation.CELL_INSIDE_QUERY
  • Segments are now also sorted during flush, and merging on a sorted index is substantially faster by using some of the same bulk merge optimizations that non-sorted merging uses

New in Apache Lucene 6.4.2 (Apr 28, 2017)

  • Bug Fixes:
  • LUCENE-7698: CommonGramsQueryFilter was producing a disconnected token graph, messing up phrase queries when it was used during query parsing
  • LUCENE-7676: Fixed FilterCodecReader to override more super-class methods. Also added TestFilterCodecReader class.
  • LUCENE-7717: The UnifiedHighlighter and PostingsHighlighter were not highlighting prefix queries with multi-byte characters. TermRangeQuery is affected too.

New in Apache Lucene 6.4.1 (Apr 28, 2017)

  • Highlights:
  • Javadocs now build successfully with Java 8u121
  • Fixed memory leak in the case that TermQuery or SpanTermQuery objects that wrap a TermContext were cached
  • Fixed native memory leak when the codec is configured with the BEST_COMPRESSION option
  • AnalyzingInfixSuggester now only opens an IndexWriter when changes need to be applied

New in Apache Lucene 6.4.0 (Apr 28, 2017)

  • Highlights:
  • Lucene's best efforts to un-map memory mapped files with "MMapDirectory" now work with the latest Java9 early access builds
  • A new similarity "BooleanSimilarity" that gives terms a score that is equal to their query boost
  • The axiomatic family of similarities (6 in total) based on https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
  • A new token filter "SynonymGraphFilter" that outputs a correct graph structure for multi-token synonyms at query time
  • Graph token streams, such as those produced by the "SynonymGraphFilter", are now handled accurately by query parsers
  • A new collector "DocValuesStatsCollector" gives the ability to compute statistics on DocValues field
  • It is now possible to filter "SortedDocValues" and "SortedSetDocValues" terms enum with a compiled automaton
  • The "UnifiedHighlighter" can now highlight fields with queries that don't necessarily refer to that field
  • DrillSideways can now run queries concurrently
  • Index sorting now supports sorting on multi-valued fields using MIN, MAX, etc. selectors
  • Points do not store the implicit split dimension in the 1-dimension case. This saves between 6% memory for the largest types such an InetAddressPoint to 33% for the smaller types such as HalfFloatPoint.
  • The BKD in-memory index for dimensional points now uses a compressed format, using substantially less RAM in some cases
  • The BKD writing now buffers each leaf block in heap before writing to disk, giving a small speedup in points-heavy use cases
  • "TermAutomatonQuery" now rewrites to more efficient queries when possible

New in Apache Lucene 6.3.0 (Apr 28, 2017)

  • Highlights:
  • A brand new "UnifiedHighlighter" derivative of the PostingsHighlighter that can consume offsets from postings, term vectors, or analysis. It can highlight phrases as accurately as the standard Highlighter. Light term vectors can be used with offsets in postings for fast wildcard (MultiTermQuery) highlighting.
  • SimpleQueryParser now parses '*' to MatchAllDocsQuery.
  • FuzzyQuery now matches all terms within the specified edit distance, even if they are short terms.
  • Points do not store the implicit split dimension in the 1-dimension case. This saves between 6% memory for the largest types such an InetAddressPoint to 33% for the smaller types such as HalfFloatPoint.
  • Many other changes and bug fixes.

New in Apache Lucene 6.2.1 (Apr 28, 2017)

  • Highlights:
  • LUCENE-7417: The standard Highlighter could throw an !IllegalArgumentException when trying to highlight a query containing a degenerate case of a !MultiPhraseQuery with one term.
  • LUCENE-7440: Document id skipping (!PostingsEnum.advance) could throw an !ArrayIndexOutOfBoundsException exception on large index segments (>1.8B docs) with large skips.
  • LUCENE-7318: Fix backwards compatibility issues around StandardAnalyzer and its components, introduced with Lucene 6.2.0. The moved classes were restored in their original packages: LowercaseFilter and StopFilter, as well as several utility classes.

New in Apache Lucene 6.2.0 (Apr 28, 2017)

  • The CREATE_NEW flag is passed when creating a file to ensure Lucene is really write-once
  • Index numeric ranges (min and max value in a single field) and search by overlapping range
  • IndexWriter methods return a sequence number indicating effective order of operations across threads
  • UkrainianMorfologikAnalyzer is a new dictionary based analyzer for the Ukrainian language
  • The Polygon class can now be created from a GeoJSON string
  • Compound file creation now verifies checksum of its component files
  • Index time sorting is now a core feature, and supports dimensional points
  • StandardAnalyzer is moved to core and is the default analyzer
  • MatchNoDocsQuery now includes the reason it was created
  • QueryParser can now be told to not pre-split on whitespace
  • MMapDirectory tries harder to prevent SIGSEGV if buggy code tries to execute searches after the index was closed, but it's still best effort
  • MMapDirectory no longer allocates weak references to ease garbage collection
  • Conjunction (MUST, FILTER) queries are faster
  • Dimensional points have much faster (~40%) flush time and use less space in the index

New in Apache Lucene 6.1.0 (Apr 28, 2017)

  • New features:
  • Numerous improvements to LatLonPoint, for indexing a latitude/longitude point and searching by polygon, distance or box, or finding nearest neighbors
  • Geo3D now has simple APIs for creating common shape queries, matching LatLonPoint
  • Optimizations:
  • Faster indexing and searching of points.
  • Faster geo-spatial indexing and searching for LatLonPoint, Geo3D and GeoPoint (see http://home.apache.org/~mikemccand/geobench.html )
  • HardlinkCopyDirectoryWrapper optimizes file copies using hard links
  • In case of contention, the query cache now prefers returning an uncached Scorer rather than waiting on a lock.
  • Bug fixes:
  • BooleanQuery could sometimes assign too low scores to ranges of documents that matched a single clause.
  • Doc values updates could sometimes be applied in the wrong order.

New in Apache Lucene 5.4.0 (Dec 15, 2015)

  • NEW FEATURES:
  • LUCENE-6747: FingerprintFilter is a TokenFilter that outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens. Useful for normalizing short text in clustering/linking tasks
  • LUCENE-6875: New Serbian Filter
  • LUCENE-6720: New FunctionRangeQuery wrapper around ValueSourceScorer (returned from ValueSource/FunctionValues.getRangeScorer())
  • LUCENE-6724: Add utility APIs to GeoHashUtils to compute neighbor geohash cells
  • LUCENE-6737: Add DecimalDigitFilter which folds unicode digits to basic latin
  • LUCENE-6699: Add integration of BKD tree and geo3d APIs to give fast, very accurate query to find all indexed points within an earth-surface shape
  • LUCENE-6838: Added IndexSearcher#getQueryCache and #getQueryCachingPolicy
  • LUCENE-6844: PayloadScoreQuery can include or exclude underlying span scores from its score calculations
  • LUCENE-6778: Add GeoPointDistanceRangeQuery, to search for points within a "ring" (beyond a minimum distance and below a maximum distance)
  • LUCENE-6874: Add a new UnicodeWhitespaceTokenizer to analysis/common that uses Unicode character properties extracted from ICU4J to tokenize text on whitespace. This tokenizer will split on non-breaking space (NBSP), too
  • API CHANGES:
  • LUCENE-6590: Query.setBoost(), Query.getBoost() and Query.clone() are gone. In order to apply boosts, you now need to wrap queries in a BoostQuery.
  • LUCENE-6716: SpanPayloadCheckQuery now takes a List rather than a Collection.
  • LUCENE-6489: The various span payload queries have been moved to the queries submodule, and PayloadSpanUtil is now in sandbox.
  • LUCENE-6650: The spatial module no longer uses Filter in any way. All spatial Filters now subclass Query. The spatial heatmap/facet API now accepts a Bits parameter to filter counts.
  • LUCENE-6301: org.apache.lucene.search.Filter is now deprecated. You should use Query objects instead of Filters, and the BooleanClause.Occur.FILTER clause in order to let Lucene know that a Query should be used for filtering but not scoring.
  • LUCENE-6939: SpanOrQuery.addClause is now deprecated, clauses should all be provided at construction time.
  • LUCENE-6855: CachingWrapperQuery is deprecated and will be removed in 6.0.
  • LUCENE-6870: DisjunctionMaxQuery#add is now deprecated, clauses should all be provided at construction time.
  • LUCENE-6884: Analyzer.tokenStream() and Tokenizer.setReader() are no longer declared as throwing IOException.
  • LUCENE-6849: Expose IndexWriter.flush() method, to move all in-memory segments to disk without opening a near-real-time reader nor calling fsync
  • LUCENE-6911: Add correct StandardQueryParser.getMultiFields() method, deprecate no-op StandardQueryParser.getMultiFields(CharSequence[]) method.
  • OPTIMIZATIONS:
  • LUCENE-6708: TopFieldCollector does not compute the score several times on the same document anymore.
  • LUCENE-6720: ValueSourceScorer, returned from FunctionValues.getRangeScorer(), now uses TwoPhaseIterator.
  • LUCENE-6756: MatchAllDocsQuery now has a dedicated BulkScorer for better performance when used as a top-level query.
  • LUCENE-6746: DisjunctionMaxQuery, BoostingQuery and BoostedQuery now create sub weights through IndexSearcher so that they can be cached.
  • LUCENE-6754: Optimized IndexSearcher.count for the cases when it can use index statistics instead of collecting all matches.
  • LUCENE-6773: Nested conjunctions now iterate over documents as if clauses were all at the same level.
  • LUCENE-6777: Reuse BytesRef when visiting term ranges in GeoPointTermsEnum to reduce GC pressure
  • LUCENE-6779: Reduce memory allocated by CompressingStoredFieldsWriter to write strings larger than 64kb by an amount equal to string's utf8 size.
  • LUCENE-6850: Optimize BooleanScorer for sparse clauses.
  • LUCENE-6840: Ordinal indexes for SORTED_SET/SORTED_NUMERIC fields and addresses for BINARY fields are now stored on disk instead of in memory.
  • LUCENE-6878: Speed up TopDocs.merge.
  • LUCENE-6885: StandardDirectoryReader (initialCapacity) tweaks
  • LUCENE-6863: Optimized storage requirements of doc values fields when less than 1% of documents have a value.
  • LUCENE-6892: various lucene.index initialCapacity tweaks
  • LUCENE-6276: Added TwoPhaseIterator.matchCost() which allows to confirm the least costly TwoPhaseIterators first.
  • LUCENE-6898: In the default codec, the last stored field value will not be fully read from disk if the supplied StoredFieldVisitor doesn't want it. So put your largest text field value last to benefit.
  • BUG FIXES:
  • LUCENE-6905: Unwrap center longitude for dateline crossing GeoPointDistanceQuery.
  • LUCENE-6817: ComplexPhraseQueryParser.ComplexPhraseQuery does not display slop in toString().
  • LUCENE-6730: Hyper-parameter c is ignored in term frequency NormalizationH1.
  • LUCENE-6742: Lovins & Finnish implementation of SnowballFilter was fixed to behave exactly as specified. A bug in the snowball compiler caused differences in output of the filter in comparison to the original test data. In addition, the performance of those filters was improved significantly.
  • LUCENE-6783: Removed side effects from FuzzyLikeThisQuery.rewrite.
  • LUCENE-6776: Fix geo3d math to handle randomly squashed planet models
  • LUCENE-6792: Fix TermsQuery.toString() to work with binary terms.
  • LUCENE-5503: When Highlighter's WeightedSpanTermExtractor converts a PhraseQuery to an equivalent SpanQuery, it would sometimes use a slop that is too low (no highlight) or determine inOrder wrong.
  • LUCENE-6790: Fix IndexWriter thread safety when one thread is handling a tragic exception but another is still committing
  • LUCENE-6810: Upgrade to Spatial4j 0.5 -- fixes some edge-case bugs in the spatial module. See https://github.com/locationtech/spatial4j/blob/master/CHANGES.md
  • LUCENE-6813: OfflineSorter no longer removes its output Path up front, and instead opens it for write with the StandardCopyOption.REPLACE_EXISTING to overwrite any prior file, so that callers can safely use Files.createTempFile for the output. This change also fixes OfflineSorter's default temp directory when running tests to use mock filesystems so e.g. we detect file handle leaks
  • LUCENE-6813: RangeTreeWriter was failing to close all file handles it opened, leading to intermittent failures on Windows
  • LUCENE-6826: Fix ClassCastException when merging a field that has no terms because they were filtered out by e.g. a FilterCodecReader
  • LUCENE-6823: LocalReplicator should use System.nanoTime as its clock source for checking for expiration
  • LUCENE-6856: The Weight wrapper used by LRUQueryCache now delegates to the original Weight's BulkScorer when applicable.
  • LUCENE-6858: Fix ContextSuggestField to correctly wrap token stream when using CompletionAnalyzer.
  • LUCENE-6872: IndexWriter handles any VirtualMachineError, not just OOM, as tragic.
  • LUCENE-6814: PatternTokenizer no longer hangs onto heap sized to the maximum input string it's ever seen, which can be a large memory "leak" if you tokenize large strings with many threads across many indices
  • LUCENE-6888: Explain output of map() function now also prints default value
  • OTHER:
  • LUCENE-6899: Upgrade randomizedtesting to 2.3.1.
  • LUCENE-6478: Test execution can hang with java.security.debug.
  • LUCENE-6862: Upgrade of RandomizedRunner to version 2.2.0.
  • LUCENE-6857: Validate StandardQueryParser with NOT operator with-in parantheses.
  • LUCENE-6827: Use explicit capacity ArrayList instead of a LinkedList in MultiFieldQueryNodeProcessor.
  • LUCENE-6812: Upgrade RandomizedTesting to 2.1.17.
  • LUCENE-6174: Improve "ant eclipse" to select right JRE for building.
  • LUCENE-6417, LUCENE-6830: Upgrade ANTLR used in expressions module to version 4.5.1-1.
  • LUCENE-6729: Upgrade ASM used in expressions module to version 5.0.4.
  • LUCENE-6738: remove IndexWriterConfig.[gs]etIndexingChain
  • LUCENE-6755: more tests of ToChildBlockJoinScorer.advance
  • LUCENE-6571: fix some private access level javadoc errors and warnings
  • LUCENE-6768: AbstractFirstPassGroupingCollector.groupSort private member is not needed.
  • LUCENE-6761: MatchAllDocsQuery's Scorers do not expose approximations anymore.
  • LUCENE-6775, LUCENE-6833: Improved MorfologikFilterFactory to allow loading of custom dictionaries from ResourceLoader. Upgraded Morfologik to version 2.0.1. The 'dictionary' attribute has been reverted back and now points at the dictionary resource to be loaded instead of the default Polish dictionary.
  • LUCENE-6797: Make GeoCircle an interface and use a factory to create it, to eventually handle degenerate cases
  • LUCENE-6800: Use XYZSolidFactory to create XYZSolids
  • LUCENE-6798: Geo3d now models degenerate (too tiny) circles as a single point
  • LUCENE-6770: Add javadocs that FSDirectory canonicalizes the path.
  • LUCENE-6795: Fix various places where code used AccessibleObject#setAccessible() without a privileged block. Code without a hard requirement to do reflection were rewritten. This makes Lucene and Solr ready for Java 9 Jigsaw's module system, where reflection on Java's runtime classes is very restricted.
  • LUCENE-6467: Simplify Query.equals.
  • LUCENE-6845: SpanScorer is now merged into Spans
  • LUCENE-6887: DefaultSimilarity is deprecated, use ClassicSimilarity for equivilent behavior, or consider switching to BM25Similarity which will become the new default in Lucene 6.0
  • LUCENE-6893: factor out CorePlusQueriesParser from CorePlusExtensionsParser
  • LUCENE-6902: Don't retry to fsync files / directories; fail immediately.
  • LUCENE-6801: Clarify JavaDocs of PhraseQuery that it in fact supports terms at the same position (as does MultiPhraseQuery), treated like a conjunction. Added test.
  • BUILD:
  • LUCENE-6732: Improve checker for invalid source patterns to also detect javadoc-style license headers. Use Groovy to implement the checks instead of plain Ant.
  • LUCENE-6594: Update forbiddenapis to 2.0.
  • TESTS:
  • LUCENE-6752: Add Math#random() to forbiddenapis.
  • CHANGES IN BACKWARDS COMPATIBILITY POLICY:
  • LUCENE-6742: The Lovins & Finnish implementation of SnowballFilter were fixed to now behave exactly like the original Snowball stemmer. If you have indexed text using those stemmers you may need to reindex.
  • CHANGES IN RUNTIME BEHAVIOR:
  • LUCENE-6772: MultiCollector now catches CollectionTerminatedException and removes the collector that threw this exception from the list of sub collectors to collect.
  • LUCENE-6784: IndexSearcher's query caching is enabled by default. Run indexSearcher.setQueryCache(null) to disable.
  • LUCENE-6305: BooleanQuery.equals and hashcode do not depend on the order of clauses anymore.

New in Apache Lucene 5.3.1 (Oct 2, 2015)

  • Bug Fixes:
  • Remove classloader hack in MorfologikFilter
  • UsageTrackingQueryCachingPolicy no longer caches trivial queries like
  • MatchAllDocsQuery
  • Fixed BoostingQuery to rewrite wrapped queries
  • Highlights of this Solr release include:
  • security.json is not loaded on server start
  • RuleBasedAuthorization plugin does not work for the collection-admin-edit
  • permission
  • VelocityResponseWriter template encoding issue. Templates must be UTF-8 encoded
  • SimplePostTool (also bin/post) -filetypes "*" now works properly in 'web' mode
  • example/files update-script.js to be Java 7 and 8 compatible.
  • SolrJ could not make requests to handlers with '/admin/' prefix
  • Use of timeAllowed can cause incomplete filters to be cached and incorrect
  • results to be returned on subsequent requests
  • VelocityResponseWriter's $resource.get(key,baseName,locale) to use specified
  • locale.
  • Resolve XSS issue in Admin UI stats page

New in Apache Lucene 5.3.0 (Aug 24, 2015)

  • New features:
  • LUCENE-6485: Add CustomSeparatorBreakIterator to postings highlighter which splits on any character. For example, it can be used with getMultiValueSeparator render whole field values.
  • LUCENE-6459: Add common suggest API that mirrors Lucene's Query/IndexSearcher APIs for Document based suggester. Adds PrefixCompletionQuery, RegexCompletionQuery, FuzzyCompletionQuery and ContextQuery.
  • LUCENE-6487: Spatial Geo3D API now has a WGS84 ellipsoid world model option.
  • LUCENE-6477: Add experimental BKD geospatial tree doc values format and queries, for fast "bbox/polygon contains lat/lon points"
  • LUCENE-6526: Asserting(Query|Weight|Scorer) now ensure scores are not computed if they are not needed.
  • LUCENE-6481: Add GeoPointField, GeoPointInBBoxQuery, GeoPointInPolygonQuery for simple "indexed lat/lon point in bbox/shape" searching.
  • LUCENE-5954: The segments_N commit point now stores the Lucene version that wrote the commit as well as the lucene version that wrote the oldest segment in the index, for faster checking of "too old" indices
  • LUCENE-6519: BKDPointInPolygonQuery is much faster by avoiding the per-hit polygon check when a leaf cell is fully contained by the polygon.
  • LUCENE-6549: Add preload option to MMapDirectory.
  • LUCENE-6504: Add Lucene53Codec, with norms implemented directly via the Directory's RandomAccessInput api.
  • LUCENE-6539: Add new DocValuesNumbersQuery, to match any document containing one of the specified long values. This change also moves the existing DocValuesTermsQuery and DocValuesRangeQuery to Lucene's sandbox module, since in general these queries are quite slow and are only fast in specific cases.
  • LUCENE-6577: Give earlier and better error message for invalid CRC.
  • LUCENE-6544: Geo3D: (1) Regularize path & polygon construction, (2) add PlanetModel.surfaceDistance() (ellipsoidal calculation), (3) cache lat & lon in GeoPoint, (4) add thread-safety where missing -- Geo3dShape.
  • LUCENE-6606: SegmentInfo.toString now confesses how the documents were sorted, when SortingMergePolicy was used
  • LUCENE-6524: IndexWriter can now be initialized from an already open near-real-time or non-NRT reader.
  • LUCENE-6578: Geo3D can now compute the distance from a point to a shape, both inner distance and to an outside edge. Multiple distance algorithms are available.
  • LUCENE-6632: Geo3D: Compute circle planes more accurately.
  • LUCENE-6653: Added general purpose BytesTermAttribute to basic token attributes package that can be used for TokenStreams that solely produce binary terms.
  • LUCENE-6365: Add Operations.topoSort, to run topological sort of the states in an Automaton
  • LUCENE-6365: Replace Operations.getFiniteStrings with a more scalable iterator API (FiniteStringsIterator)
  • LUCENE-6589: Add a new org.apache.lucene.search.join.CheckJoinIndex class that can be used to validate that an index has an appropriate structure to run join queries.
  • LUCENE-6659: Remove IndexWriter's unnecessary hard limit on max concurrency
  • LUCENE-6547: Add GeoPointDistanceQuery, matching all points within the specified distance from the center point. Fix GeoPointInBBoxQuery to handle dateline crossing.
  • LUCENE-6694: Add LithuanianAnalyzer and LithuanianStemmer.
  • LUCENE-6695: Added a new BlendedTermQuery to blend statistics across several terms.
  • LUCENE-6706: Added a new PayloadScoreQuery that generalises the behaviour of PayloadTermQuery and PayloadNearQuery to all Span queries.
  • LUCENE-6697: Add experimental range tree doc values format and queries, based on a 1D version of the spatial BKD tree, for a faster and smaller alternative to postings-based numeric and binary term filtering. Range trees can also handle values larger than 64 bits.
  • LUCENE-6647: Add GeoHash string utility APIs
  • LUCENE-6710: GeoPointField now uses full 64 bits (up from 62) to encode lat/lon
  • LUCENE-6580: SpanNearQuery now allows defined-width gaps in its subqueries
  • LUCENE-6712: Use doc values to post-filter GeoPointField hits that fall in boundary cells, resulting in smaller index, faster searches and less heap used for each query
  • API Changes:
  • LUCENE-6508: Simplify Lock api, there is now just Directory.obtainLock() which returns a Lock that can be released (or fails with exception). Add lock verification to IndexWriter. Improve exception messages when locking fails.
  • LUCENE-6529: Removed an optimization in UninvertingReader that was causing incorrect results for Numeric fields using precisionStep
  • LUCENE-6551: Add missing ConcurrentMergeScheduler.getAutoIOThrottle getter
  • LUCENE-6552: Add MergePolicy.OneMerge.getMergeInfo and rename setInfo to setMergeInfo
  • LUCENE-6525: Deprecate IndexWriterConfig's writeLockTimeout.
  • LUCENE-6466: Moved SpanQuery.getSpans() and .extractTerms() to SpanWeight
  • LUCENE-6371, LUCENE-6490: Payload collection from Spans is moved to a more generic SpanCollector framework. Spans no longer implements .hasPayload() and .getPayload() methods, and instead exposes a collect() method that allows the collection of arbitrary postings information. SpanPayloadCheckQuery and SpanPayloadNearCheckQuery have moved from the .spans package to the .payloads package.
  • LUCENE-6583: FilteredQuery is deprecated and will be removed in 6.0. It should be replaced with a BooleanQuery which handle the query as a MUST clause and the filter as a FILTER clause.
  • LUCENE-6553: The postings, spans and scorer APIs no longer take an acceptDocs parameter. Live docs are now always checked on top of these APIs.
  • LUCENE-6634: PKIndexSplitter now takes a Query instead of a Filter to decide how to split an index.
  • LUCENE-6643: GroupingSearch from lucene/grouping was changed to take a Query object to define groups instead of a Filter.
  • LUCENE-6554: ToParentBlockJoinFieldComparator was removed because of a bug with missing values that could not be fixed. ToParentBlockJoinSortField now works with string or numeric doc values selectors. Sorting on anything else than a string or numeric field would require to implement a custom selector.
  • LUCENE-6648: All lucene/facet APIs now take Query objects where they used to take Filter objects.
  • LUCENE-6640: Suggesters now take a BitsProducer object instead of a Filter object to reduce the scope of doc IDs that may be returned, emphasizing the fact that these objects need to support random-access.
  • LUCENE-6646: Make EarlyTerminatingCollector take a Sort object directly instead of a SortingMergePolicy.
  • LUCENE-6649: BitDocIdSetFilter and BitDocIdSetCachingWrapperFilter are now deprecated in favour of BitSetProducer and QueryBitSetProducer, which do not extend oal.search.Filter.
  • LUCENE-6607: Factor out geo3d into its own spatial3d module.
  • LUCENE-6531: PhraseQuery is now immutable and can be built using the PhraseQuery.Builder class.
  • LUCENE-6570: BooleanQuery is now immutable and can be built using the BooleanQuery.Builder class.
  • LUCENE-6702: NRTSuggester: Add a method to inject context values at index time in ContextSuggestField. Simplify ContextQuery logic for extracting contexts and add dedicated method to consider all context values at query time.
  • LUCENE-6719: NumericUtils getMinInt, getMaxInt, getMinLong, getMaxLong now return null if there are no terms for the specified field, previously these methods returned primitive values and raised an undocumented NullPointerException if there were no terms for the field.
  • Bug fixes:
  • LUCENE-6500: ParallelCompositeReader did not always call closed listeners. This was fixed by LUCENE-6501.
  • LUCENE-6520: Geo3D GeoPath.done() would throw an NPE if adjacent path segments were co-linear.
  • LUCENE-5805: QueryNodeImpl.removeFromParent was doing nothing in a costly manner
  • LUCENE-6533: SlowCompositeReaderWrapper no longer caches its live docs instance since this can prevent future improvements like a disk-backed live docs
  • LUCENE-6558: Highlighters now work with CustomScoreQuery
  • (Cao Manh Dat via Mike McCandless)
  • LUCENE-6560: BKDPointInBBoxQuery now handles "dateline crossing" correctly
  • LUCENE-6564: Change PrintStreamInfoStream to use thread safe Java 8 ISO-8601 date formatting (in Lucene 5.x use Java 7 FileTime#toString as workaround); fix output of tests to use same format.
  • LUCENE-6593: Fixed ToChildBlockJoinQuery's scorer to not refuse to advance to a document that belongs to the parent space.
  • LUCENE-6591: Never write a negative vLong
  • LUCENE-6588: Fix how ToChildBlockJoinQuery deals with acceptDocs.
  • LUCENE-6597: Geo3D's GeoCircle now supports a world-globe diameter.
  • LUCENE-6608: Fix potential resource leak in BigramDictionary.
  • LUCENE-6614: Improve partition detection in IOUtils#spins() so it works with NVMe drives.
  • LUCENE-6586: Fix typo in GermanStemmer, causing possible wrong value for substCount.
  • LUCENE-6658: Fix IndexUpgrader to also upgrade indexes without any segments.
  • LUCENE-6677: QueryParserBase fails to enforce maxDeterminizedStates when creating a WildcardQuery
  • LUCENE-6680: Preserve two suggestions that have same key and weight but different payloads
  • LUCENE-6681: SortingMergePolicy must override MergePolicy.size(...).
  • LUCENE-6682: StandardTokenizer performance bug: scanner buffer is unnecessarily copied when maxTokenLength doesn't change. Also stop silently maxing out buffer size (and effectively also max token length) at 1M chars, but instead throw an exception from setMaxTokenLength() when the given length is greater than 1M chars.
  • LUCENE-6696: Fix FilterDirectoryReader.close() to never close the underlying reader several times.
  • LUCENE-6334: FastVectorHighlighter failed to highlight phrases across more than one value in a multi-valued field.
  • LUCENE-6704: GeoPointDistanceQuery was visiting too many term ranges, consuming too much heap for a large radius
  • SOLR-5882: fix ScoreMode.Min at ToParentBlockJoinQuery
  • LUCENE-6718: JoinUtil.createJoinQuery failed to rewrite queries before creating a Weight.
  • LUCENE-6713: TooComplexToDeterminizeException claims to be serializable but wasn't
  • LUCENE-6723: Fix date parsing problems in Java 9 with date formats using English weekday/month names.
  • LUCENE-6618: Properly set MMapDirectory.UNMAP_SUPPORTED when it is now allowed by security policy.
  • Changes in runtime behavior:
  • LUCENE-6501: The subreader structure in ParallelCompositeReader was flattened, because the current implementation had too many hidden bugs regarding refounting and close listeners. If you create a new ParallelCompositeReader, it will just take all leaves of the passed readers and form a flat structure of ParallelLeafReaders instead of trying to assemble the original structure of composite and leaf readers.
  • LUCENE-6538: Also include java.vm.version and java.runtime.version in per-segment diagnostics
  • LUCENE-6537: NearSpansOrdered no longer tries to minimize its Span matches. This means that the matching algorithm is entirely lazy. All spans returned by the previous implementation are still reported, but matching documents may now also return additional spans that were previously discarded in preference to shorter overlapping ones.
  • LUCENE-6569: Optimize MultiFunction.anyExists and allExists to eliminate excessive array creation in common 2 argument usage
  • LUCENE-2880: Span queries now score more consistently with regular queries.
  • LUCENE-6601: FilteredQuery now always rewrites to a BooleanQuery which handles the query as a MUST clause and the filter as a FILTER clause. LEAP_FROG_QUERY_FIRST_STRATEGY and LEAP_FROG_FILTER_FIRST_STRATEGY do not guarantee anymore which iterator will be advanced first, it will depend on the respective costs of the iterators. QUERY_FIRST_FILTER_STRATEGY and RANDOM_ACCESS_FILTER_STRATEGY still consume the filter using its random-access API, however the returned bits may be called on different documents compared to before.
  • LUCENE-6542: FSDirectory's ctor now works with security policies or file systems that restrict write access.
  • LUCENE-6651: The default implementation of AttributeImpl#reflectWith(AttributeReflector) now uses AccessControler#doPrivileged() to do the reflection. Please consider implementing this method in all your custom attributes, because the method will be made abstract in Lucene 6.
  • LUCENE-6639: LRUQueryCache and CachingWrapperQuery now consider a query as "used" when the first Scorer is pulled instead of when a Scorer is pulled on the first segment on an index.
  • LUCENE-6579: IndexWriter now sacrifices (closes) itself to protect the index when an unexpected, tragic exception strikes while merging.
  • LUCENE-6691: SortingMergePolicy.isSorted now considers FilterLeafReader instances. EarlyTerminatingSortingCollector.terminatedEarly accessor added. TestEarlyTerminatingSortingCollector.testTerminatedEarly test added.
  • LUCENE-6609: Add getSortField impls to many subclasses of FieldCacheSource which return the most direct SortField implementation. In many trivial sort by ValueSource usages, this will result in less RAM, and more precise sorting of extreme values due to no longer converting to double.
  • Optimizations:
  • LUCENE-6548: Some optimizations for BlockTree's intersect with very finite automata
  • LUCENE-6585: Flatten conjunctions and conjunction approximations into parent conjunctions. For example a sloppy phrase query of "foo bar"~5 with a filter of "baz" will internally leapfrog foo,bar,baz as one conjunction.
  • LUCENE-6325: Reduce RAM usage of FieldInfos, and speed up lookup by number, by using an array instead of TreeMap except in very sparse cases
  • LUCENE-6617: Reduce heap usage for small FSTs
  • LUCENE-6616: IndexWriter now lists the files in the index directory only once on init, and IndexFileDeleter no longer suppresses FileNotFoundException and NoSuchFileException. This also improves IndexFileDeleter to delete segments_N files last, so that in the presence of a virus checker, the index is never left in a state where an expired segments_N references non-existing files
  • LUCENE-6645: Optimized the way we merge postings lists in multi-term queries and TermsQuery. This should especially help when there are lots of small postings lists.
  • LUCENE-6668: Optimized storage for sorted set and sorted numeric doc values in the case that there are few unique sets of values.
  • LUCENE-6690: Sped up MultiTermsEnum.next() on high-cardinality fields.
  • LUCENE-6621: Removed two unused variables in analysis/stempel/src/java/org/ egothor/stemmer/Compile.java
  • Build:
  • LUCENE-6518: Don't report false thread leaks from IBM J9 ClassCache Reaper in test framework.
  • LUCENE-6567: Simplify payload checking in SpanPayloadCheckQuery
  • LUCENE-6568: Make rat invocation depend on ivy configuration being set up
  • LUCENE-6683: ivy-fail goal directs people to non-existent page
  • LUCENE-6693: Updated Groovy to 2.4.4, Pegdown to 1.5, Svnkit to 1.8.10. Also fixed some PermGen errors while running full build caused by these updates: Tasks are now installed from root's build.xml.
  • LUCENE-6741: Fix jflex files to regenerate the java files correctly.
  • Test Framework:
  • LUCENE-6637: Fix FSTTester to not violate file permissions on -Dtests.verbose=true.
  • LUCENE-6542: LuceneTestCase now has runWithRestrictedPermissions() to run an action with reduced permissions. This can be used to simulate special environments (e.g., read-only dirs). If tests are running without a security manager, an assume cancels test execution automatically.
  • LUCENE-6652: Removed lots of useless Byte(s)TermAttributes all over test infrastructure.
  • LUCENE-6563: Improve MockFileSystemTestCase.testURI to check if a path can be encoded according to local filesystem requirements. Otherwise stop test execution.
  • Changes in backwards compatibility policy:
  • LUCENE-6553: The iterator returned by the LeafReader.postings method now always includes deleted docs, so you have to check for deleted documents on top of the iterator.
  • LUCENE-6633: DuplicateFilter has been deprecated and will be removed in 6.0. DiversifiedTopDocsCollector can be used instead with a maximum number of hits per key equal to 1.
  • LUCENE-6653: The workflow for consuming the TermToBytesRefAttribute was changed: getBytesRef() now does all work and is called on each token, fillBytesRef() was removed. The implementation is free to reuse the internal BytesRef or return a new one on each call.
  • LUCENE-6682: StandardTokenizer.setMaxTokenLength() now throws an exception if a length greater than 1M chars is given. Previously the effective max token length (the scanner's buffer) was capped at 1M chars, but getMaxTokenLength() incorrectly returned the previously requested length, even when it exceeded 1M.

New in Apache Lucene 5.2.1 (Jul 1, 2015)

  • Bug Fixes:
  • LUCENE-6482: Fix class loading deadlock relating to Codec initialization, default codec and SPI discovery.
  • LUCENE-6523: NRT readers now reflect a new commit even if there is no change to the commit user data
  • LUCENE-6527: Queries now get a dummy Similarity when scores are not needed in order to not load unnecessary information like norms.

New in Apache Lucene 5.2.0 (Jun 9, 2015)

  • New features:
  • LUCENE-6308, LUCENE-6385, LUCENE-6391: Span queries now share document conjunction/intersection code with boolean queries, and use two-phased iterators for faster intersection by avoiding loading positions in certain cases.
  • LUCENE-6393: Add two-phase support to SpanPositionCheckQuery and its subclasses: SpanPositionRangeQuery, SpanPayloadCheckQuery, SpanNearPayloadCheckQuery, SpanFirstQuery.
  • LUCENE-6394: Add two-phase support to SpanNotQuery and refactor FilterSpans to just have an accept(Spans candidate) method for subclasses.
  • LUCENE-6373: SpanOrQuery shares disjunction logic with boolean queries, and supports two-phased iterators to avoid loading positions when possible.
  • LUCENE-6352, LUCENE-6472: Added a new query time join to the join module that uses global ordinals, which is faster for subsequent joins between reopens.
  • LUCENE-5879: Added experimental auto-prefix terms to BlockTree terms dictionary, exposed as AutoPrefixPostingsFormat
  • LUCENE-5579: New CompositeSpatialStrategy combines speed of RPT with accuracy of SDV. Includes optimized Intersect predicate to avoid many geometry checks. Uses TwoPhaseIterator.
  • LUCENE-5989: Allow passing BytesRef to StringField to make it easier to index arbitrary binary tokens, and change the experimental StoredFieldVisitor.stringField API to take UTF-8 byte[] instead of String
  • LUCENE-6389: Added ScoreMode.Min that aggregates the lowest child score to the parent hit.
  • LUCENE-6423: New LimitTokenOffsetFilter that limits tokens to those before a configured maximum start offset.
  • LUCENE-6422: New spatial PackedQuadPrefixTree, a generally more efficient choice than QuadPrefixTree, especially for high precision shapes. When used, you should typically disable RPT's pruneLeafyBranches option.
  • LUCENE-6451: Expressions now support bindings keys that look like zero arg functions
  • LUCENE-6083: Add SpanWithinQuery and SpanContainingQuery that return spans inside of / containing another spans.
  • LUCENE-6454: Added distinction between member variable and method in expression helper VariableContext
  • LUCENE-6196: New Spatial "Geo3d" API with partial Spatial4j integration. It is a set of shapes implemented using 3D planar geometry for calculating spatial relations on the surface of a sphere. Shapes include Point, BBox, Circle, Path (buffered line string), and Polygon.
  • LUCENE-6464: Add a new expert lookup method to AnalyzingInfixSuggester to accept an arbitrary BooleanQuery to express how contexts should be filtered.
  • Optimizations:
  • LUCENE-6379: IndexWriter.deleteDocuments(Query...) now detects if one of the queries is MatchAllDocsQuery and just invokes the much faster IndexWriter.deleteAll in that case
  • LUCENE-6388: Optimize SpanNearQuery when payloads are not present.
  • LUCENE-6421: Defer reading of positions in MultiPhraseQuery until they are needed.
  • LUCENE-6392: Highligher- reduce memory of tokens in TokenStreamFromTermVector, and add maxStartOffset limit.
  • LUCENE-6456: Queries that generate doc id sets that are too large for the query cache are not cached instead of evicting everything.
  • LUCENE-6455: Require a minimum index size to enable query caching in order not to cache eg. on MemoryIndex.
  • LUCENE-6330: BooleanScorer (used for top-level disjunctions) does not decode norms when not necessary anymore.
  • LUCENE-6350: TermsQuery is now compressed with PrefixCodedTerms.
  • LUCENE-6458: Multi-term queries matching few terms per segment now execute like a disjunction.
  • LUCENE-6360: TermsQuery rewrites to a disjunction when there are 16 matching terms or less.
  • Bug fixes:
  • LUCENE-329: Fix FuzzyQuery defaults to rank exact matches highest.
  • LUCENE-6378: Fix all RuntimeExceptions to throw the underlying root cause.
  • LUCENE-6415: TermsQuery.extractTerms is a no-op (used to throw an UnsupportedOperationException).
  • LUCENE-6416: BooleanQuery.extractTerms now only extracts terms from scoring clauses.
  • LUCENE-6409: Fixed integer overflow in LongBitSet.ensureCapacity.
  • LUCENE-6424, LUCENE-6430: Fix many bugs with mockfs filesystems in the test-framework: always consistently wrap Path, fix buggy behavior for globs, implement equals/hashcode for filtered Paths, etc.
  • LUCENE-6426: Fix FieldType's copy constructor to also copy over the numeric precision step.
  • LUCENE-6345: Null check terms/fields in Lucene queries
  • LUCENE-6400: SolrSynonymParser should preserve original token instead of replacing it with a synonym, when expand=true and there is no explicit mapping
  • LUCENE-6449: Don't throw NullPointerException if some segments are missing the field being highlighted, in PostingsHighlighter
  • LUCENE-6427: Added assertion about the presence of ghost bits in (Fixed|Long)BitSet.
  • LUCENE-6468: Fixed NPE with empty Kuromoji user dictionary.
  • LUCENE-6483: Ensure core closed listeners are called on the same cache key as the reader which has been used to register the listener.
  • LUCENE-6486 DocumentDictionary iterator no longer skips documents with no payloads and now returns an empty BytesRef instead
  • LUCENE-6505: NRT readers now reflect segments_N filename and commit user data from previous commits
  • LUCENE-6507: Don't let NativeFSLock.close() release other locks
  • API Changes:
  • LUCENE-6377: SearcherFactory#newSearcher now accepts the previous reader to simplify warming logic during opening new searchers.
  • LUCENE-6410: Removed unused "reuse" parameter to Terms.iterator.
  • LUCENE-6425: Replaced Query.extractTerms with Weight.extractTerms.
  • LUCENE-6446: Simplified Explanation API.
  • LUCENE-6445: Two new methods in Highlighter's TokenSources; the existing methods are now marked deprecated.
  • LUCENE-6484: Removed EliasFanoDocIdSet, which was unused.
  • LUCENE-6497: Allow subclasses of FieldType to check frozen state
  • Other:
  • LUCENE-6413: Test runner should report the number of suites completed/ remaining.
  • LUCENE-5439: Add 'ant jacoco' build target.
  • LUCENE-6315: Simplify the private iterator Lucene uses internally when resolving deleted terms to matched docids.
  • LUCENE-6399: Benchmark module's QueryMaker.resetInputs should call setConfig so queries can react to property changes in new rounds.
  • LUCENE-6382: Lucene now enforces that positions never exceed the maximum value IndexWriter.MAX_POSITION.
  • LUCENE-6372: Simplified and improved equals/hashcode of span queries.
  • Build:
  • LUCENE-6420: Update forbiddenapis to v1.8
  • Test framework:
  • LUCENE-6419: Added two-phase iteration assertions to AssertingQuery.
  • LUCENE-6437: Randomly set CPU core count and spins, derived from test's master seed, used by ConcurrentMergeScheduler to set dynamic defaults, for better test randomization and to help tests reproduce

New in Apache Lucene 5.0.0 (Feb 24, 2015)

  • New features:
  • LUCENE-5945: All file handling converted to NIO.2 apis.
  • LUCENE-5946: SimpleFSDirectory now uses Files.newByteChannel, for portability with custom FileSystemProviders. If you want the old non-interruptible behavior of RandomAccessFile, use RAFDirectory in the misc/ module.
  • SOLR-3359: Added analyzer attribute/property to SynonymFilterFactory.
  • LUCENE-5648: Index and search date ranges, particularly multi-valued ones. It's implemented in the spatial module as DateRangePrefixTree used with NumberRangePrefixTreeStrategy.
  • LUCENE-5895: Lucene now stores a unique id per-segment and per-commit to aid in accurate replication of index files
  • LUCENE-5889: Add commit method to AnalyzingInfixSuggester, and allow just using .add to build up the suggester.
  • LUCENE-5123: Add a "pull" option to the postings writing API, so that a PostingsFormat now receives a Fields instance and it is responsible for iterating through all fields, terms, documents and positions.
  • LUCENE-5268: Full cutover of all postings formats to the "pull" FieldsConsumer API, removing PushFieldsConsumer. Added new PushPostingsWriterBase for single-pass push of docs/positions to the postings format.
  • LUCENE-5906: Use Files.delete everywhere instead of File.delete, so that when things go wrong, you get a real exception message why.
  • LUCENE-5933: Added FilterSpans for easier wrapping of Spans instance.
  • LUCENE-5925: Remove fallback logic from opening commits, instead use Directory.renameFile so that in-progress commits are never visible.
  • LUCENE-5820: SuggestStopFilter should have a factory.
  • LUCENE-5949: Add Accountable.getChildResources().
  • SOLR-5986: Added ExitableDirectoryReader that extends FilterDirectoryReader and enables exiting requests that take too long to enumerate over terms.
  • LUCENE-5911: Add MemoryIndex.freeze() to allow thread-safe searching over a MemoryIndex.
  • LUCENE-5969: Lucene 5.0 has a new index format with mismatched file detection, improved exception handling, and indirect norms encoding for sparse fields.
  • LUCENE-6053: Add Serbian analyzer.
  • LUCENE-4400: Add support for new NYSIIS Apache commons phonetic codec
  • LUCENE-6059: Add Daitch-Mokotoff Soundex phonetic Apache commons phonetic codec, and upgrade to Apache commons codec 1.10.
  • LUCENE-6058: With the upgrade to Apache commons codec 1.10, the experimental BeiderMorseFilter has changed its behavior, so any index using it will need to be rebuilt.
  • LUCENE-6050: Accept MUST and MUST_NOT (in addition to SHOULD) for each context passed to Analyzing/BlendedInfixSuggester
  • LUCENE-5929: Also extract terms to highlight from block join queries.
  • LUCENE-6063: Allow overriding whether/how ConcurrentMergeScheduler stalls incoming threads when merges are falling behind
  • LUCENE-5833: DocumentDictionary now enumerates each value separately in a multi-valued field (not just the first value), so you can build suggesters from multi-valued fields
  • LUCENE-6077: Added a filter cache.
  • LUCENE-6088: TermsFilter implements Accountable.
  • LUCENE-6034: The default highlighter when used with QueryScorer will highlight payload-sensitive queries provided that term vectors with positions, offsets, and payloads are present. This is the only highlighter that can highlight such queries accurately.
  • LUCENE-5914: Add an option to Lucene50Codec to support either BEST_SPEED or BEST_COMPRESSION for stored fields.
  • LUCENE-6119: Add auto-IO-throttling to ConcurrentMergeScheduler, to rate limit IO writes for each merge depending on incoming merge rate.
  • LUCENE-6155: Add payload support to MemoryIndex. The default highlighter's QueryScorer and WeighedSpanTermExtractor now have setUsePayloads(bool).
  • LUCENE-6166: Deletions (alone) can now trigger new merges.
  • LUCENE-6177: Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.
  • Optimizations:
  • LUCENE-5960: Use a more efficient bitset, not a Set, to track visited states
  • LUCENE-5959: Don't allocate excess memory when building automaton in finish.
  • LUCENE-5963: Reduce memory allocations in AnalyzingSuggester.
  • LUCENE-5938: MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE is now faster on queries that match few documents by using a sparse bit set implementation.
  • LUCENE-5969: Refactor merging to be more efficient, checksum calculation is per-segment/per-producer, and norms and doc values merging no longer cause RAM spikes for latent fields.
  • LUCENE-5983: CachingWrapperFilter now uses a new DocIdSet implementation called RoaringDocIdSet instead of WAH8DocIdSet.
  • LUCENE-6022: DocValuesDocIdSet checks live docs before doc values.
  • LUCENE-6030: Add norms patched compression for a small number of common values
  • LUCENE-6040: Speed up EliasFanoDocIdSet through broadword bit selection.
  • LUCENE-6033: CachingTokenFilter now uses ArrayList not LinkedList, and has new isCached() method.
  • LUCENE-6031: TokenSources (in the default highlighter) converts term vectors into a TokenStream much faster in linear time (not N*log(N) using less memory, and with reset() implemented. Only one of offsets or positions are required of the term vector.
  • LUCENE-6089, LUCENE-6090: Tune CompressionMode.HIGH_COMPRESSION for better compression and less cpu usage.
  • LUCENE-6034: QueryScorer, used by the default highlighter, needn't re-index the provided TokenStream with MemoryIndex when it comes from TokenSources (term vectors) with offsets and positions.
  • LUCENE-5951: ConcurrentMergeScheduler detects whether the index is on SSD or not and does a better job defaulting its settings. This only works on Linux for now; other OS's will continue to use the previous defaults (tuned for spinning disks).
  • LUCENE-6131: Optimize SortingMergePolicy.
  • LUCENE-6133: Improve default StoredFieldsWriter.merge() to be more efficient.
  • LUCENE-6145: Make EarlyTerminatingSortingCollector able to early-terminate when the sort order is a prefix of the index-time order.
  • LUCENE-6178: Score boolean queries containing MUST_NOT clauses with BooleanScorer2, to use skip list data and avoid unnecessary scoring.
  • API Changes:
  • LUCENE-5900: Deprecated more constructors taking Version in InfixSuggester and ICUCollationKeyAnalyzer, and removed TEST_VERSION_CURRENT from the test framework.
  • LUCENE-4535: oal.util.FilterIterator is now an internal API.
  • LUCENE-4924: DocIdSetIterator.docID() must now return -1 when the iterator is not positioned. This change affects all classes that inherit from DocIdSetIterator, including DocsEnum and DocsAndPositionsEnum.
  • LUCENE-5127: Reduce RAM usage of FixedGapTermsIndex. Remove IndexWriterConfig.setTermIndexInterval, IndexWriterConfig.setReaderTermsIndexDivisor, and termsIndexDivisor from StandardDirectoryReader. These options have been no-ops with the default codec since Lucene 4.0. If you want to configure the interval for this term index, pass it directly in your codec, where it can also be configured per-field.
  • LUCENE-5388: Remove Reader from Tokenizer's constructor and from Analyzer's createComponents. TokenStreams now always get their input via setReader.
  • LUCENE-5527: The Collector API has been refactored to use a dedicated Collector per leaf.
  • LUCENE-5702: The FieldComparator API has been refactor to a per-leaf API, just like Collectors.
  • LUCENE-4246: IndexWriter.close now always closes, even if it throws an exception. The new IndexWriterConfig.setCommitOnClose (default true) determines whether close() should commit before closing.
  • LUCENE-5608, LUCENE-5565: Refactor SpatialPrefixTree/Cell API. Doesn't use Strings as tokens anymore, and now iterates cells on-demand during indexing instead of building a collection. RPT now has more setters.
  • LUCENE-5666: Change uninverted access (sorting, faceting, grouping, etc) to use the DocValues API instead of FieldCache. For FieldCache functionality, use UninvertingReader in lucene/misc (or implement your own FilterReader). UninvertingReader is more efficient: supports multi-valued numeric fields, detects when a multi-valued field is single-valued, reuses caches of compatible types (e.g. SORTED also supports BINARY and SORTED_SET access without insanity). "Insanity" is no longer possible unless you explicitly want it. Rename FieldCache and DocTermOrds classes in the search package to DocValues. Move SortedSetSortField to core and add SortedSetFieldSource to queries/, which takes the same selectors. Add helper methods to DocValues.java that are better suited for search code (never return null, etc).
  • LUCENE-5871: Remove Version from IndexWriterConfig. Use IndexWriterConfig.setCommitOnClose to change the behavior of IndexWriter.close(). The default has been changed to match that of 4.x.
  • LUCENE-5965: CorruptIndexException requires a String or DataInput resource.
  • LUCENE-5972: IndexFormatTooOldException and IndexFormatTooNewException now extend from IOException.
  • LUCENE-5569: *AtomicReader/AtomicReaderContext have been renamed to *LeafReader/LeafReaderContext.
  • LUCENE-5938: Removed MultiTermQuery.ConstantScoreAutoRewrite as MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE is usually better.
  • LUCENE-5924: Rename CheckIndex -fix option to -exorcise. This option does not actually fix the index, it just drops data.
  • LUCENE-5969: Add Codec.compoundFormat, which handles the encoding of compound files. Add getMergeInstance() to codec producer APIs, which can be overridden to return an instance optimized for merging instead of searching. Add Terms.getStats() which can return additional codec-specific statistics about a field. Change instance method SegmentInfos.read() to two static methods: SegmentInfos.readCommit() and SegmentInfos.readLatestCommit().
  • LUCENE-5992: Remove FieldInfos from SegmentInfosWriter.write API.
  • LUCENE-5998: Simplify Field/SegmentInfoFormat to read+write methods.
  • LUCENE-6000: Removed StandardTokenizerInterface. Tokenizers now use their jflex impl directly.
  • LUCENE-6006: Removed FieldInfo.normType since it's redundant: it will be DocValuesType.NUMERIC if the field indexed and does not omit norms, else null.
  • LUCENE-6013: Removed indexed boolean from IndexableFieldType and FieldInfo, since it's redundant with IndexOptions != null.
  • LUCENE-6021: FixedBitSet.nextSetBit now returns DocIdSetIterator.NO_MORE_DOCS instead of -1 when there are no more bits which are set.
  • LUCENE-5953: Directory and LockFactory APIs were restructured: Locking is now under the responsibility of the Directory implementation. LockFactory is only used by subclasses of BaseDirectory to delegate locking to an impl class. LockFactories are now singletons and are responsible to create a Lock instance based on a Directory implementation passed to the factory method.
  • LUCENE-6062: Throw exception instead of silently doing nothing if you try to sort/group/etc on a misconfigured field (e.g. no docvalues, no UninvertingReader, etc).
  • LUCENE-6068: LeafReader.fields() never returns null.
  • LUCENE-6082: Remove abort() from codec apis.
  • LUCENE-6084: IndexOutput's constructor now requires a String resourceDescription so its toString is sane
  • LUCENE-6087: Allow passing custom DirectoryReader to SearcherManager
  • LUCENE-6085: Undeprecate SegmentInfo attributes, but add safety so they won't be trappy if codec tries to use them during docvalues updates.
  • LUCENE-6097: Remove dangerous / overly expert IndexWriter.abortMerges and waitForMerges methods.
  • LUCENE-6099: Add FilterDirectory.unwrap and FilterDirectoryReader.unwrap
  • LUCENE-6121: CachingTokenFilter.reset() now propagates to its input if called before incrementToken(). You must call reset() now on this filter instead of doing it a-priori on the input(), which previously didn't work.
  • LUCENE-6147: Make the core Accountables.namedAccountable function public
  • LUCENE-6150: Remove staleFiles set and onIndexOutputClosed() from FSDirectory.
  • LUCENE-6146: Replaced Directory.copy() with Directory.copyFrom().
  • LUCENE-6149: Infix suggesters' highlighting and allTermsRequired can be set at the constructor for non-contextual lookup.
  • LUCENE-6158, LUCENE-6165: IndexWriter.addIndexes(IndexReader...) changed to addIndexes(CodecReader...)
  • LUCENE-6179: Out-of-order scoring is not allowed anymore, so Weight.scoresDocsOutOfOrder and LeafCollector.acceptsDocsOutOfOrder have been removed and boolean queries now always score in order.
  • LUCENE-6212: IndexWriter no longer accepts per-document Analyzer to add/updateDocument. These methods were trappy as they made it easy to accidentally index tokens that were not easily searchable.
  • Bug fixes:
  • LUCENE-5650: Enforce read-only access to any path outside the temporary folder via security manager, and make test temp dirs absolute.
  • LUCENE-5948: RateLimiter now fully inits itself on init.
  • LUCENE-5981: CheckIndex obtains write.lock, since with some parameters it may modify the index, and to prevent false corruption reports, as it does not have the regular "spinlock" of DirectoryReader.open. It now implements Closeable and you must close it to release the lock.
  • LUCENE-6004: Don't highlight the LookupResult.key returned from AnalyzingInfixSuggester
  • Christian Reuschling, jane chang via Mike McCandless)
  • LUCENE-5980: Don't let document length overflow.
  • LUCENE-5999: Fix backcompat support for StandardTokenizer
  • LUCENE-5961: Fix the exists() method for FunctionValues returned by many ValueSources to behave properly when wrapping other ValueSources which do not exist for the specified document
  • LUCENE-6039: Add IndexOptions.NONE and DocValuesType.NONE instead of using null to mean not index and no doc values, renamed IndexOptions.DOCS_ONLY to DOCS, and pulled IndexOptions and DocValues out of FieldInfo into their own classes in org.apache.lucene.index
  • LUCENE-6043: Fix backcompat support for UAX29URLEmailTokenizer
  • LUCENE-6041: Remove sugar methods FieldInfo.isIndexed and FieldInfo.hasDocValues.
  • LUCENE-6044: Fix backcompat support for token filters with enablePositionIncrements=false. Also fixed backcompat for TrimFilter with updateOffsets=true. These options are supported with a match version before 4.4, and no longer valid at all with 5.0.
  • LUCENE-6042: CustomScoreQuery explain was incorrect in some cases, such as when nested inside a boolean query.
  • LUCENE-6046: Add maxDeterminizedStates safety to determinize (which has an exponential worst case) so that if it would create too many states, it now throws an exception instead of exhausting CPU/RAM.
  • LUCENE-6054: Allow repeating the empty automaton
  • LUCENE-6049: Don't throw cryptic exception writing a segment when the only docs in it had fields that hit non-aborting exceptions during indexing but also had doc values
  • LUCENE-6055: PayloadAttribute.clone() now does a deep clone of the underlying bytes.
  • LUCENE-6060: Remove dangerous IndexWriter.unlock method
  • LUCENE-6062: Pass correct fieldinfos to docvalues producer when the segment has updates.
  • LUCENE-6075: Don't overflow int in SimpleRateLimiter
  • LUCENE-5987: IndexWriter will now forcefully close itself on aborting exception (an exception that would otherwise cause silent data loss).
  • LUCENE-6094: Allow IW.rollback to stop ConcurrentMergeScheduler even when it's stalling because there are too many merges.
  • LUCENE-6105: Don't cache FST root arcs if the number of root arcs is small, or if the cache would be > 20% of the size of the FST.
  • LUCENE-6124: Fix double-close() problems in codec and store APIs.
  • LUCENE-6152: Fix double close problems in OutputStreamIndexOutput.
  • LUCENE-6139: Highlighter: TokenGroup start & end offset getters should have been returning the offsets of just the matching tokens in the group when there's a distinction.
  • LUCENE-6173: NumericTermAttribute and spatial/CellTokenStream do not clone their BytesRef(Builder)s. Also equals/hashCode was missing.
  • LUCENE-6205: Fixed intermittent concurrency issue that could cause FileNotFoundException when writing doc values updates at the same time that a merge kicks off.
  • LUCENE-6192: Fix int overflow corruption case in skip data for high frequency terms in extremely large indices
  • LUCENE-6093: Don't throw NullPointerException from BlendedInfixSuggester for lookups that do not end in a prefix token.
  • LUCENE-6214: Fixed IndexWriter deadlock when one thread is committing while another opens a near-real-time reader and an unrecoverable (tragic) exception is hit.
  • Documentation:
  • LUCENE-5392: Add/improve analysis package documentation to reflect analysis API changes.
  • LUCENE-6057: Improve Sort(SortField) docs
  • LUCENE-6112: Fix compile error in FST package example code
  • Tests:
  • LUCENE-5957: Add option for tests to not randomize codec
  • LUCENE-5974: Add check that backcompat indexes use default codecs
  • LUCENE-5971: Create addBackcompatIndexes.py script to build and add backcompat test indexes for a given lucene version. Also renamed backcompat index files to use Version.toString() in filename.
  • LUCENE-6002: Monster tests no longer fail. Most of them now have an 80 hour timeout, effectively removing the timeout. The tests that operate near the 2 billion limit now use IndexWriter.MAX_DOCS instead of Integer.MAX_VALUE. Some of the slow Monster tests now explicitly choose the default codec.
  • LUCENE-5968: Improve error message when 'ant beast' is run on top-level modules.
  • LUCENE-6120: Fix MockDirectoryWrapper's close() handling.
  • Build:
  • LUCENE-5909: Smoke tester now has better command line parsing and optionally also runs on Java 8.
  • LUCENE-5902: Add bumpVersion.py script to manage version increase after release branch is cut.
  • LUCENE-5962: Rename diffSources.py to createPatch.py and make it work with all text file types.
  • LUCENE-5995: Upgrade ICU to 54.1
  • LUCENE-6070: Upgrade forbidden-apis to 1.7
  • Other:
  • LUCENE-5563: Removed sep layout: which has fallen behind on features and doesn't perform as well as other options.
  • LUCENE-4086: Removed support for Lucene 3.x indexes. See migration guide for more information.
  • LUCENE-5858: Moved Lucene 4 compatibility codecs to 'lucene-backward-codecs.jar'.
  • LUCENE-5915: Remove Pulsing postings format.
  • LUCENE-6213: Add useful exception message when commit contains segments from legacy codecs.

New in Apache Lucene 4.10.2 (Dec 24, 2014)

  • Bug fixes:
  • LUCENE-5977: Fix tokenstream safety checks in IndexWriter to properly work across multi-valued fields. Previously some cases across multi-valued fields would happily create a corrupt index.
  • LUCENE-6019: Detect when DocValuesType illegally changes for the same field name. Also added -Dtests.asserts=true|false so we can run tests with and without assertions.

New in Apache Lucene 4.10.1 (Sep 30, 2014)

  • Bug fixes:
  • LUCENE-5934: Fix backwards compatibility for 4.0 indexes.
  • LUCENE-5939: Regenerate old backcompat indexes to ensure they were built with the exact release
  • LUCENE-5952: Improve error messages when version cannot be parsed; don't check for too old or too new major version (it's too low level to enforce here); use simple string tokenizer.
  • LUCENE-5958: Don't let exceptions during checkpoint corrupt the index. Refactor existing OOM handling too, so you don't need to handle OOM special for every IndexWriter method: instead such disasters will cause IW to close itself defensively.
  • LUCENE-5904: Fixed a corruption case that can happen when 1) IndexWriter is uncleanly shut-down (OS crash, power loss, etc.), 2) on startup, when a new IndexWriter is created, a virus checker is holding some of the previously written but unused files open and preventing deletion, 3) IndexWriter writes these files again during the course of indexing, then the files can later be deleted, causing corruption. This case was detected by adding evilness to MockDirectoryWrapper to have it simulate a virus checker holding a file open and preventing deletion
  • LUCENE-5916: Static scope test components should be consistent between tests (and test iterations). Fix for FaultyIndexInput in particular.
  • LUCENE-5975: Fix reading of 3.0-3.3 indexes, where bugs in these old index formats would result in CorruptIndexException "did not read all bytes from file" when reading the deleted docs file.

New in Apache Lucene 4.10.0 (Sep 4, 2014)

  • New Features:
  • LUCENE-5778: Support hunspell morphological description fields/aliases. (Robert Muir)
  • LUCENE-5801: Added (back) OrdinalMappingAtomicReader for merging search indexes that contain category ordinals from separate taxonomy indexes. (Nicola Buso via Shai Erera)
  • LUCENE-4175, LUCENE-5714, LUCENE-5779: Index and search rectangles with spatial BBoxSpatialStrategy using most predicates. Sort documents by relative overlap of query areas or just by indexed shape area. (Ryan McKinley, David Smiley)
  • LUCENE-5806: Extend expressions grammar to support array access in variables. Added helper class VariableContext to parse complex variable into pieces. (Ryan Ernst)
  • LUCENE-5826: Support proper hunspell case handling, LANG, KEEPCASE, NEEDAFFIX, and ONLYINCOMPOUND flags. (Robert Muir)
  • LUCENE-5815: Add TermAutomatonQuery, a proximity query allowing you to create an arbitrary automaton, using terms on the transitions, expressing which sequence of sequential terms (including a special "any" term) are allowed. This is a generalization of MultiPhraseQuery and span queries, and enables "correct" (including position) length search-time graph synonyms. (Mike McCandless)
  • LUCENE-5819: Add OrdsLucene41 block tree terms dict and postings format, to include term ordinals in the index so the optional TermsEnum.ord() and TermsEnum.seekExact(long ord) APIs work. (Mike McCandless)
  • LUCENE-5835: TermValComparator can sort missing values last. (Adrien Grand)
  • LUCENE-5825: Benchmark module can use custom postings format, e.g.: codec.postingsFormat=Memory. (Varun Shenoy, David Smiley)
  • LUCENE-5842: When opening large files (where its to expensive to compare checksum against all the bytes), retrieve checksum to validate structure of footer, this can detect some forms of corruption such as truncation. (Robert Muir)
  • LUCENE-5739: Added DataInput.readZ(Int|Long) and DataOutput.writeZ(Int|Long) to read and write small signed integers. (Adrien Grand)
  • API Changes:
  • LUCENE-5752: Simplified Automaton API to be immutable. (Mike McCandless)
  • LUCENE-5793: Add equals/hashCode to FieldType. (Shay Banon, Robert Muir)
  • LUCENE-5692: DisjointSpatialFilter is deprecated (used by RecursivePrefixTreeStrategy). (David Smiley)
  • LUCENE-5771: SpatialOperation's predicate names are now aliased to OGC standard names. Thus you can use: Disjoint, Equals, Intersects, Overlaps, Within, Contains, Covers, CoveredBy. The area requirement on the predicates was removed, and Overlaps' definition was fixed. (David Smiley)
  • LUCENE-5850: Made Version handling more robust and extensible. Deprecated Constants.LUCENE_MAIN_VERSION, Constants.LUCENE_VERSION and current Version constants of the form LUCENE_X_Y. Added version constants that include bugfix number of form LUCENE_X_Y_Z. Changed Version.LUCENE_CURRENT to Version.LATEST. CheckIndex now prints the Lucene version used to write each segment. (Ryan Ernst, Uwe Schindler, Robert Muir, Mike McCandless)
  • LUCENE-5836: BytesRef has been splitted into BytesRef, whose intended usage is to be just a reference to a section of a larger byte[] and BytesRefBuilder which is a StringBuilder-like class for BytesRef instances. (Adrien Grand)
  • LUCENE-5883: You can now change the MergePolicy instance on a live IndexWriter, without first closing and reopening the writer. This allows to e.g. run a special merge with UpgradeIndexMergePolicy without reopening the writer. Also, MergePolicy no longer implements Closeable; if you need to release your custom MegePolicy's resources, you need to implement close() and call it explicitly. (Shai Erera)
  • LUCENE-5859: Deprecate Analyzer constructors taking Version. Use Analyzer.setVersion() to set the version an analyzer to replicate behavior from a specific release. (Ryan Ernst, Robert Muir)
  • Optimizations:
  • LUCENE-5780: Make OrdinalMap more memory-efficient, especially in case the first segment has all values. (Adrien Grand, Robert Muir)
  • LUCENE-5782: OrdinalMap now sorts enums before being built in order to improve compression. (Adrien Grand)
  • LUCENE-5798: Optimize MultiDocsEnum reuse. (Robert Muir)
  • LUCENE-5799: Optimize numeric docvalues merging. (Robert Muir)
  • LUCENE-5797: Optimize norms merging. (Adrien Grand, Robert Muir)
  • LUCENE-5803: Add DelegatingAnalyzerWrapper, an optimized variant of AnalyzerWrapper that doesn't allow to wrap components or readers. This wrapper class is the base class of all analyzers that just delegate to another analyzer, e.g. per field name: PerFieldAnalyzerWrapper and Solr's schema support. (Shay Banon, Uwe Schindler, Robert Muir)
  • LUCENE-5795: MoreLikeThisQuery now only collects the top N terms instead of collecting all terms from the like text when building the query. (Alex Ksikes, Simon Willnauer)
  • LUCENE-5681: Fix RAMDirectory's IndexInput to not do double buffering on slices (causes useless data copying, especially on random access slices). This also improves slices of NRTCachingDirectory, because the cache is based on RAMDirectory. BufferedIndexInput.wrap() was marked with a warning in javadocs. It is almost always a better idea to implement slicing on your own. (Uwe Schindler, Robert Muir)
  • LUCENE-5834: Empty sorted set and numeric doc values are now singletons. (Adrien Grand)
  • LUCENE-5841: Improve performance of block tree terms dictionary when assigning terms to blocks. (Mike McCandless)
  • LUCENE-5856: Optimize Fixed/Open/LongBitSet to remove unnecessary AND. (Robert Muir)
  • LUCENE-5884: Optimize FST.ramBytesUsed. (Adrien Grand, Robert Muir, Mike McCandless)
  • LUCENE-5882: Add Lucene410DocValuesFormat, with faster term lookups for SORTED/SORTED_SET fields. (Robert Muir)
  • LUCENE-5887: Remove WeakIdentityMap caching in AttributeFactory, AttributeSource, and VirtualMethod in favour of Java 7's ClassValue. Always use MethodHandles to create AttributeImpl classes. (Uwe Schindler)
  • Bug Fixes:
  • LUCENE-5796: Fixes the Scorer.getChildren() method for two combinations of BooleanQuery. (Terry Smith via Robert Muir)
  • LUCENE-5790: Fix compareTo in MutableValueDouble and MutableValueBool, this caused incorrect results when grouping on fields with missing values. (海老澤 志信, hossman)
  • LUCENE-5817: Fix hunspell zero-affix handling: previously only zero-strips worked correctly. (Robert Muir)
  • LUCENE-5818, LUCENE-5823: Fix hunspell overgeneration for short strings that also match affixes, words are only stripped to a zero-length string if FULLSTRIP option is specifed in the dictionary. (Robert Muir)
  • LUCENE-5824: Fix hunspell 'long' flag handling. (Robert Muir)
  • LUCENE-5827: Make all Directory implementations correctly fail with IllegalArgumentException if slices are out of bounds. (Uwe SChindler)
  • LUCENE-5838: Fix hunspell when the .aff file has over 64k affixes. (Robert Muir)
  • LUCENE-5844: ArrayUtil.grow/oversize now returns a maximum of Integer.MAX_VALUE - 8 for the maximum array size. (Robert Muir, Mike McCandless)
  • LUCENE-5843: Added IndexWriter.MAX_DOCS which is the maximum number of documents allowed in a single index, and any operations that add documents will now throw IllegalStateException if the max count would be exceeded, instead of silently creating an unusable index. (Mike McCandless)
  • LUCENE-5869: Added restriction to positive values for maxExpansions in FuzzyQuery. (Ryan Ernst)
  • LUCENE-5672: IndexWriter.addIndexes() calls maybeMerge(), to ensure the index stays healthy. If you don't want merging use NoMergePolicy instead. (Robert Muir)
  • LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (Chris Geeringh, Robert Muir, Steve Rowe)
  • LUCENE-5907: Fix corruption case when opening a pre-4.x index with IndexWriter, then opening an NRT reader from that writer, then calling commit from the writer, then closing the NRT reader. This case would remove the wrong files from the index leading to a corrupt index. (Mike McCandless)
  • LUCENE-5908: Fix Lucene43NGramTokenizer to be final
  • Test Framework:
  • LUCENE-5786: Unflushed/ truncated events file (hung testing subprocess). (Dawid Weiss)
  • LUCENE-5881: Add "beasting" of tests: repeats the whole "test" Ant target N times with "ant beast -Dbeast.iters=N". (Uwe Schindler, Robert Muir, Ryan Ernst, Dawid Weiss)
  • Build:
  • LUCENE-5770: Upgrade to JFlex 1.6, which has direct support for supplementary code points - as a result, ICU4J is no longer used to generate surrogate pairs to augment JFlex scanner specifications. (Steve Rowe)
  • SOLR-6358: Remove VcsDirectoryMappings from idea configuration vcs.xml. (Ramkumar Aiyengar via Steve Rowe)

New in Apache Lucene 4.9.0 (Jun 26, 2014)

  • Changes in Runtime Behavior:
  • LUCENE-5611: Changing the term vector options for multiple field instances by the same name in one document is not longer accepted; IndexWriter will now throw IllegalArgumentException.
  • LUCENE-5646: Remove rare/undertested bulk merge algorithm in CompressingStoredFieldsWriter.
  • New Features:
  • LUCENE-5610: Add Terms.getMin and Terms.getMax to get the lowest and highest terms, and NumericUtils.get{Min/Max}{Int/Long} to get the minimum numeric values from the provided Terms
  • LUCENE-5675: Add IDVersionPostingsFormat, a postings format optimized for primary-key (ID) fields that also record a version (long) for each ID
  • LUCENE-5680: Add ability to atomically update a set of DocValues fields
  • LUCENE-5717: Add support for multiterm queries nested inside filtered and constant-score queries to postings highlighter
  • LUCENE-5731, LUCENE-5760: Add RandomAccessInput, a random access API for directory. Add DirectReader/Writer, optimized for reading packed integers directly from Directory. Add Lucene49Codec and Lucene49DocValuesFormat that make use of these
  • LUCENE-5743: Add Lucene49NormsFormat, which can compress in some cases such as very short fields
  • LUCENE-5748: Add SORTED_NUMERIC docvalues type, which is efficient for processing numeric fields with multiple values
  • LUCENE-5754: Allow "$" as part of variable and function names in expressions module
  • Changes in Backwards Compatibility Policy:
  • LUCENE-5634: Add reuse argument to IndexableField.tokenStream. This can be used by custom fieldtypes, which don't use the Analyzer, but implement their own TokenStream
  • LUCENE-5640: AttributeSource.AttributeFactory was moved to a top-level class: org.apache.lucene.util.AttributeFactory
  • LUCENE-4371: Removed IndexInputSlicer and Directory.createSlicer() and replaced with IndexInput.slice()
  • LUCENE-5727, LUCENE-5678: Remove IndexOutput.seek, IndexOutput.setLength()
  • API Changes:
  • LUCENE-5756: IndexWriter now implements Accountable and IW#ramSizeInBytes() has been deprecated infavor of IW#ramBytesUsed()
  • LUCENE-5725: MoreLikeThis#like now accepts multiple values per field. The pre-existing method has been deprecated in favor of a variable arguments for the like text.
  • LUCENE-5711: MergePolicy accepts an IndexWriter instance on each method rather than holding state against a single IndexWriter instance.
  • LUCENE-5582: Deprecate IndexOutput.length (just use IndexOutput.getFilePointer instead) and IndexOutput.setLength.
  • LUCENE-5621: Deprecate IndexOutput.flush: this is not used by Lucene.
  • LUCENE-5611: Simplified Lucene's default indexing chain / APIs. AttributeSource/TokenStream.getAttribute now returns null if the attribute is not present (previously it threw IllegalArgumentException). StoredFieldsWriter.startDocument no longer receives the number of fields that will be added
  • LUCENE-5632: In preparation for coming Lucene versions, the Version enum constants were renamed to make them better readable. The constant for Lucene 4.9 is now "LUCENE_4_9". Version.parseLeniently() is still able to parse the old strings ("LUCENE_49"). The old identifiers got deprecated and will be removed in Lucene 5.0.
  • LUCENE-5633: Change NoMergePolicy to a singleton with no distinction between compound and non-compound types.
  • LUCENE-5640: The Token class was deprecated. Since Lucene 2.9, TokenStreams are using Attributes, Token is no longer used.
  • LUCENE-5679: Consolidated IndexWriter.deleteDocuments(Term) and IndexWriter.deleteDocuments(Query) with their varargs counterparts.
  • LUCENE-5706: Removed the option to unset a DocValues field through DocValues updates.
  • LUCENE-5700: Added oal.util.Accountable that is now implemented by all classes whose memory usage can be estimated.
  • LUCENE-5708: Remove IndexWriterConfig.clone, so now IndexWriter simply uses the IndexWriterConfig you pass it, and you must create a new IndexWriterConfig for each IndexWriter.
  • LUCENE-5701: Core closed listeners are now available in the AtomicReader API, they used to sit only in SegmentReader.
  • LUCENE-5678: IndexOutput no longer allows seeking, so it is no longer required to use RandomAccessFile to write Indexes. Lucene now uses standard FileOutputStream wrapped with OutputStreamIndexOutput to write index data. BufferedIndexOutput was removed, because buffering and checksumming is provided by FilterOutputStreams, provided by the JDK.
  • LUCENE-5703: BinaryDocValues API changed to work like TermsEnum and not allocate/ copy bytes on each access, you are responsible for cloning if you want to keep data around.
  • LUCENE-5695: DocIdSet implements Accountable.
  • LUCENE-5757: Moved RamUsageEstimator's reflection-based processing to RamUsageTester in the test-framework module.
  • LUCENE-5761: Removed DiskDocValuesFormat, it was very inefficient and saved very little RAM over the default codec.
  • LUCENE-5775: Deprecate JaspellLookup.
  • Optimizations:
  • LUCENE-5603: hunspell stemmer more efficiently strips prefixes and suffixes.
  • LUCENE-5599: HttpReplicator did not properly delegate bulk read() to wrapped InputStream.
  • LUCENE-5591: pass an IOContext with estimated flush size when applying DV updates.
  • LUCENE-5634: IndexWriter reuses TokenStream instances for String and Numeric fields by default.
  • LUCENE-5638, LUCENE-5640: TokenStream uses a more performant AttributeFactory by default, that packs the core attributes into one implementation (PackedTokenAttributeImpl), for faster clearAttributes(),saveState(), and restoreState(). In addition, AttributeFactory uses Java 7 MethodHandles for instantiating Attribute implementations.
  • LUCENE-5609: Changed the default NumericField precisionStep from 4 to 8 (for int/float) and 16 (for long/double), for faster indexing time and smaller indices.
  • LUCENE-5670: Add skip/FinalOutput to FST Outputs.
  • LUCENE-4236: Optimize BooleanQuery's in-order scoring. This speeds up some types of boolean queries.
  • LUCENE-5694: Don't score() subscorers in DisjunctionSumScorer or DisjunctionMaxScorer unless score() is called.
  • LUCENE-5720: Optimize DirectPackedReader's decompression.
  • LUCENE-5722: Optimize ByteBufferIndexInput#seek() by specializing implementations. This improves random access as used by docvalues codecs if used with MMapDirectory.
  • LUCENE-5730: FSDirectory.open returns MMapDirectory for 64-bit operating systems, not just Linux and Windows.
  • LUCENE-5703: BinaryDocValues producers don't allocate or copy bytes on each access anymore.
  • LUCENE-5721: Monotonic compression doesn't use zig-zag encoding anymore.
  • LUCENE-5750: Speed up monotonic addressing for BINARY and SORTED_SET docvalues.
  • LUCENE-5751: Speed up MemoryDocValues.
  • LUCENE-5767: OrdinalMap optimizations, that mostly help on low cardinalities.
  • LUCENE-5769: SingletonSortedSetDocValues now supports random access ordinals.
  • Bug fixes:
  • LUCENE-5738: Ensure NativeFSLock prevents opening the file channel for the lock if the lock is already obtained by the JVM. Trying to obtain an already obtained lock in the same JVM can unlock the file might allow other processes to lock the file even without explicitly unlocking the FileLock. This behavior is operating system dependent.
  • LUCENE-5673: MMapDirectory: Work around a "bug" in the JDK that throws a confusing OutOfMemoryError wrapped inside IOException if the FileChannel mapping failed because of lack of virtual address space. The IOException is rethrown with more useful information about the problem, omitting the incorrect OutOfMemoryError.
  • LUCENE-5682: NPE in QueryRescorer when Scorer is null
  • LUCENE-5691: DocTermOrds lookupTerm(BytesRef) would return incorrect results if the underlying TermsEnum supports ord() and the insertion point would be at the end.
  • LUCENE-5618, LUCENE-5636: SegmentReader referenced unneeded files following doc-values updates. Now doc-values field updates are written in separate file per field.
  • LUCENE-5684: Make best effort to detect invalid usage of Lucene, when IndexReader is reopened after all files in its index were removed and recreated by the application (the proper way to do this is IndexWriter.deleteAll, or opening an IndexWriter with OpenMode.CREATE)
  • LUCENE-5704: Fix compilation error with Java 8u20.
  • LUCENE-5710: Include the inner exception as the cause and in the exception message when an immense term is hit during indexing
  • LUCENE-5724: CompoundFileWriter was failing to pass through the IOContext in some cases, causing NRTCachingDirectory to cache compound files when it shouldn't, then causing OOMEs.
  • LUCENE-5747: Project-specific settings for the eclipse development environment will prevent automatic code reformatting.
  • LUCENE-5768, LUCENE-5777: Hunspell condition checks containing character classes were buggy.
  • Test Framework:
  • LUCENE-5622: Fail tests if they print over the given limit of bytes to System.out or System.err.
  • LUCENE-5619: Added backwards compatibility tests to ensure we can update existing indexes with doc-values updates.
  • Build:
  • LUCENE-5442: The Ant check-lib-versions target now runs Ivy resolution transitively, then fails the build when it finds a version conflict: when a transitive dependency's version is more recent than the direct dependency's version specified in lucene/ivy-versions.properties. Exceptions are specifiable in lucene/ivy-ignore-conflicts.properties.
  • LUCENE-5715: Upgrade direct dependencies known to be older than transitive dependencies: com.sun.jersey.version:1.8->1.9; com.sun.xml.bind:jaxb-impl:2.2.2->2.2.3-1; commons-beanutils:commons-beanutils:1.7.0->1.8.3; commons-digester:commons-digester:2.0->2.1; commons-io:commons-io:2.1->2.3; commons-logging:commons-logging:1.1.1->1.1.3; io.netty:netty:3.6.2.Final->3.7.0.Final; javax.activation:activation:1.1->1.1.1; javax.mail:mail:1.4.1->1.4.3; log4j:log4j:1.2.16->1.2.17; org.apache.avro:avro:1.7.4->1.7.5; org.tukaani:xz:1.2->1.4; org.xerial.snappy:snappy-java:1.0.4.1->1.0.5

New in Apache Lucene 4.8.1 (May 20, 2014)

  • Bug fixes:
  • LUCENE-5639: Fix PositionLengthAttribute implementation in Token class.
  • LUCENE-5635: IndexWriter didn't properly handle IOException on TokenStream.reset(), which could leave the analyzer in an inconsistent state.
  • LUCENE-5599: HttpReplicator did not properly delegate bulk read() to wrapped InputStream.
  • LUCENE-5600: HttpClientBase did not properly consume a connection if a server error occurred.
  • LUCENE-5628: Change getFiniteStrings to iterative not recursive implementation, so that building suggesters on a long suggestion doesn't risk overflowing the stack; previously it consumed one Java stack frame per character in the expanded suggestion. If you are building a suggester this is a nasty trap.
  • LUCENE-5559: Add additional argument validation for CapitalizationFilter and CodepointCountFilter.
  • LUCENE-5641: SimpleRateLimiter would silently rate limit at 8 MB/sec even if you asked for higher rates.
  • LUCENE-5644: IndexWriter clears which threads use which internal thread states on flush, so that if an application reduces how many threads it uses for indexing, that results in a reduction of how many segments are flushed on a full-flush (e.g. to obtain a near-real-time reader).
  • LUCENE-5653: JoinUtil with ScoreMode.Avg on a multi-valued field with more than 256 values would throw exception.
  • LUCENE-5654: Fix various close() methods that could suppress throwables such as OutOfMemoryError, instead returning scary messages that look like index corruption.
  • LUCENE-5656: Fix rare fd leak in SegmentReader when multiple docvalues fields have been updated with IndexWriter.updateXXXDocValue and one hits exception.
  • LUCENE-5660: AnalyzingSuggester.build will now throw IllegalArgumentException if you give it a longer suggestion than it can handle
  • LUCENE-5662: Add missing checks to Field to prevent IndexWriter.abort if a stored value is null.
  • LUCENE-5668: Fix off-by-one in TieredMergePolicy
  • LUCENE-5671: Upgrade ICU version to fix an ICU concurrency problem that could cause exceptions when indexing.

New in Apache Lucene 4.8.0 (Apr 28, 2014)

  • System Requirements:
  • LUCENE-4747, LUCENE-5514: Move to Java 7 as minimum Java version
  • Changes in Runtime Behavior:
  • LUCENE-5472: IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH. To recreate previous behavior of silently ignoring these terms, use LengthFilter in your Analyzer
  • New Features:
  • LUCENE-5356: Morfologik filter can accept custom dictionary resources
  • LUCENE-5454: Add SortedSetSortField to lucene/sandbox, to allow sorting on multi-valued field
  • LUCENE-5478: CommonTermsQuery now allows to create custom term queries similar to the query parser by overriding a newTermQuery method
  • LUCENE-5477: AnalyzingInfixSuggester now supports near-real-time additions and updates (to change weight or payload of an existing suggestion)
  • LUCENE-5482: Improve default TurkishAnalyzer by adding apostrophe handling suitable for Turkish
  • LUCENE-5479: FacetsConfig subclass can now customize the default per-dim facets configuration
  • LUCENE-5485: Add circumfix support to HunspellStemFilter
  • LUCENE-5224: Add iconv, oconv, and ignore support to HunspellStemFilter
  • LUCENE-5493: SortingMergePolicy, and EarlyTerminatingSortingCollector support arbitrary Sort specifications
  • LUCENE-3758: Allow the ComplexPhraseQueryParser to search order or un-order proximity queries
  • LUCENE-5530: ComplexPhraseQueryParser throws ParseException for fielded queries
  • LUCENE-5513: Add IndexWriter.updateBinaryDocValue which lets you update the value of a BinaryDocValuesField without reindexing the document(s)
  • LUCENE-4072: Add ICUNormalizer2CharFilter, which lets you do unicode normalization with offset correction before the tokenizer
  • LUCENE-5476: Add RandomSamplingFacetsCollector for computing facets on a sampled set of matching hits, in cases where there are millions of hits
  • LUCENE-4984: Add SegmentingTokenizerBase, abstract class for tokenizers that want to do two-pass tokenization such as by sentence and then by word
  • LUCENE-5489: Add Rescorer/QueryRescorer, to resort the hits from a first pass search using scores from a more costly second pass search
  • LUCENE-5528: Add context to suggesters (InputIterator and Lookup classes), and fix AnalyzingInfixSuggester to handle contexts. Suggester contexts allow you to filter suggestions
  • LUCENE-5545: Add SortRescorer and Expression.getRescorer, to resort the hits from a first pass search using a Sort or an Expression
  • LUCENE-5558: Add TruncateTokenFilter which truncates terms to the specified length
  • LUCENE-2446: Added checksums to lucene index files. As of 4.8, the last 8 bytes of each file contain a zlib-crc32 checksum. Small metadata files are verified on load. Larger files can be checked on demand via AtomicReader.checkIntegrity. You can configure this to happen automatically before merges by enabling IndexWriterConfig.setCheckIntegrityAtMerge
  • LUCENE-5580: Checksums are automatically verified on the default stored fields format when performing a bulk merge
  • LUCENE-5602: Checksums are automatically verified on the default term vectors format when performing a bulk merge
  • LUCENE-5583: Added DataInput.skipBytes. ChecksumIndexInput can now seek, but only forward
  • LUCENE-5588: Lucene now calls fsync() on the index directory, ensuring that all file metadata is persisted on disk in case of power failure. This does not work on all file systems and operating systems, but Linux and MacOSX are known to work. On Windows, fsyncing a directory is not possible with Java APIs
  • API Changes:
  • LUCENE-5454: Add RandomAccessOrds, an optional extension of SortedSetDocValues that supports random access to the ordinals in a document
  • LUCENE-5468: Move offline Sort (from suggest module) to OfflineSort
  • LUCENE-5493: SortingMergePolicy and EarlyTerminatingSortingCollector take Sort instead of Sorter. BlockJoinSorter is removed, replaced with BlockJoinComparatorSource, which can take a Sort for ordering of parents and a separate Sort for ordering of children within a block
  • LUCENE-5516: MergeScheduler#merge() now accepts a MergeTrigger as well as a boolean that indicates if a new merge was found in the caller thread before the scheduler was called
  • LUCENE-5487: Separated bulk scorer (new Weight.bulkScorer method) from normal scoring (Weight.scorer) for those queries that can do bulk scoring more efficiently, e.g. BooleanQuery in some cases. This also simplified the Weight.scorer API by removing the two confusing booleans
  • LUCENE-5519: TopNSearcher now allows to retrieve incomplete results if the max size of the candidate queue is unknown. The queue can still be bound in order to apply pruning while retrieving the top N but will not throw an exception if too many results are rejected to guarantee an absolutely correct top N result. The TopNSearcher now returns a struct like class that indicates if the result is complete in the sense of the top N or not. Consumers of this API should assert on the completeness if the bounded queue size is know ahead of time
  • LUCENE-4984: Deprecate ThaiWordFilter and smartcn SentenceTokenizer and WordTokenFilter. These filters would not work correctly with CharFilters and could not be safely placed at an arbitrary position in the analysis chain. Use ThaiTokenizer and HMMChineseTokenizer instead
  • LUCENE-5543: Remove/deprecate Directory.fileExists
  • LUCENE-5573: Move docvalues constants and helper methods to o.a.l.index.DocValues
  • LUCENE-5604: Switched BytesRef.hashCode to MurmurHash3 (32 bit). TermToBytesRefAttribute.fillBytesRef no longer returns the hash code. BytesRefHash now uses MurmurHash3 for its hashing
  • Optimizations:
  • LUCENE-5468: HunspellStemFilter uses 10 to 100x less RAM. It also loads all known openoffice dictionaries without error, and supports an additional longestOnly option for a less aggressive approach
  • LUCENE-4848: Use Java 7 NIO2-FileChannel instead of RandomAccessFile for NIOFSDirectory and MMapDirectory. This allows to delete open files on Windows if NIOFSDirectory is used, mmapped files are still locked
  • LUCENE-5515: Improved TopDocs#merge to create a merged ScoreDoc array with length of at most equal to the specified size instead of length equal to at most from + size as was before
  • LUCENE-5529: Spatial search of non-point indexed shapes should be a little faster due to skipping intersection tests on redundant cells
  • Bug fixes:
  • LUCENE-5483: Fix inaccuracies in HunspellStemFilter. Multi-stage affix-stripping, prefix-suffix dependencies, and COMPLEXPREFIXES now work correctly according to the hunspell algorithm. Removed recursionCap parameter, as its no longer needed, rules for recursive affix application are driven correctly by continuation classes in the affix file
  • LUCENE-5497: HunspellStemFilter properly handles escaped terms and affixes without conditions
  • LUCENE-5505: HunspellStemFilter ignores BOM markers in dictionaries and handles varying types of whitespace in SET/FLAG commands
  • LUCENE-5507: Fix HunspellStemFilter loading of dictionaries with large amounts of aliases etc before the encoding declaration
  • LUCENE-5111: Fix WordDelimiterFilter to return offsets in correct order
  • LUCENE-5555: Fix SortedInputIterator to correctly encode/decode contexts in presence of payload
  • LUCENE-5559: Add missing argument checks to tokenfilters taking numeric arguments
  • LUCENE-5568: Benchmark module's "default.codec" option didn't work
  • SOLR-5983: HTMLStripCharFilter is treating CDATA sections incorrectly
  • LUCENE-5615: Validate per-segment delete counts at write time, to help catch bugs that might otherwise cause corruption
  • LUCENE-5612: NativeFSLockFactory no longer deletes its lock file. This cannot be done safely without the risk of deleting someone else's lock file. If you use NativeFSLockFactory, you may see write.lock hanging around from time to time: its harmless
  • LUCENE-5624: Ensure NativeFSLockFactory does not leak file handles if it is unable to obtain the lock
  • LUCENE-5626: Fix bug in SimpleFSLockFactory's obtain() that sometimes throwed IOException (ERROR_ACESS_DENIED) on Windows if the lock file was created concurrently. This error is now handled the same way like in NativeFSLockFactory by returning false
  • LUCENE-5630: Add missing META-INF entry for UpperCaseFilterFactory

New in Apache Lucene 4.7.2 (Apr 15, 2014)

  • Bug Fixes:
  • LUCENE-5574: Closing a near-real-time reader no longer attempts to delete unreferenced files if the original writer has been closed; this could cause index corruption in certain cases where index files were directly changed (deleted, overwritten, etc.) in the index directory outside of Lucene.
  • LUCENE-5570: Don't let FSDirectory.sync() create new zero-byte files, instead throw exception if a file is missing.

New in Apache Lucene 4.7.1 (Apr 2, 2014)

  • Changes in Runtime Behavior:
  • LUCENE-5532: AutomatonQuery.equals is no longer implemented as "accepts same language". This was inconsistent with hashCode, and unnecessary for any subclasses in Lucene. If you desire this in a custom subclass, minimize the automaton.
  • Bug Fixes:
  • LUCENE-5450: Fix getField() NPE issues with SpanOr/SpanNear when they have an empty list of clauses. This can happen for example, when a wildcard matches no terms.
  • LUCENE-5473: Throw IllegalArgumentException, not NullPointerException, if the synonym map is empty when creating SynonymFilter
  • LUCENE-5432: EliasFanoDocIdSet: Fix number of index entry bits when the maximum entry is a power of 2.
  • LUCENE-5466: query is always null in countDocsWithClass() of SimpleNaiveBayesClassifier.
  • LUCENE-5502: Fixed TermsFilter.equals that could return true for different filters.
  • LUCENE-5522: FacetsConfig didn't add drill-down terms for association facet fields labels.
  • LUCENE-5520: ToChildBlockJoinQuery would hit ArrayIndexOutOfBoundsException if a parent document had no children
  • LUCENE-5532: AutomatonQuery.hashCode was not thread-safe.
  • LUCENE-5525: Implement MultiFacets.getAllDims, so you can do sparse facets through DrillSideways, for example.
  • LUCENE-5481: IndexWriter.forceMerge used to run a merge even if there was a single segment in the index.
  • LUCENE-5538: Fix FastVectorHighlighter bug with index-time synonyms when the query is more complex than a single phrase.
  • LUCENE-5544: Exceptions during IndexWriter.rollback could leak file handles and the write lock.
  • LUCENE-4978: Spatial RecursivePrefixTree queries could result in false-negatives for indexed shapes within 1/2 maxDistErr from the edge of the query shape. This meant searching for a point by the same point as a query rarely worked.
  • LUCENE-5553: IndexReader#ReaderClosedListener is not always invoked when IndexReader#close() is called or if refCount is 0. If an exception is thrown during interal close or on any of the close listerns some or all listerners might be missed. This can cause memory leaks if the core listeners are used to clear caches.
  • Build:
  • LUCENE-5511: "ant precommit" / "ant check-svn-working-copy" now work again with any working copy format (thanks to svnkit 1.8.4).

New in Apache Lucene 4.7.0 (Feb 27, 2014)

  • New Features:
  • LUCENE-5336: Add SimpleQueryParser: parser for human-entered queries.
  • LUCENE-5337: Add Payload support to FileDictionary (Suggest) and make it more configurable
  • LUCENE-5329: suggest: DocumentDictionary and DocumentExpressionDictionary are now lenient for dirty documents (missing the term, weight or payload).
  • LUCENE-5404: Add .getCount method to all suggesters (Lookup); persist count metadata on .store(); Dictionary returns InputIterator; Dictionary.getWordIterator renamed to .getEntryIterator.
  • SOLR-1871: The RangeMapFloatFunction accepts an arbitrary ValueSource as target and default values.
  • LUCENE-5371: Speed up Lucene range faceting from O(N) per hit to O(log(N)) per hit using segment trees; this only really starts to matter in practice if the number of ranges is over 10 or so.
  • LUCENE-5379: Add Analyzer for Kurdish.
  • LUCENE-5369: Added an UpperCaseFilter to make UPPERCASE tokens.
  • LUCENE-5345: Add a new BlendedInfixSuggester, which is like AnalyzingInfixSuggester but boosts suggestions that matched tokens with lower positions.
  • LUCENE-4399: When sorting by String (SortField.STRING), you can now specify whether missing values should be sorted first (the default), using SortField.setMissingValue(SortField.STRING_FIRST), or last, using SortField.setMissingValue(SortField.STRING_LAST).
  • LUCENE-5099: QueryNode should have the ability to detach from its node parent. Added QueryNode.removeFromParent() that allows nodes to be detached from its parent node.
  • LUCENE-5395 LUCENE-5451: Upgrade to Spatial4j 0.4.1: Parses WKT (including ENVELOPE) with extension "BUFFER"; buffering a point results in a Circle. JTS isn't needed for WKT any more but remains required for Polygons. New Shapes: ShapeCollection and BufferedLineString. Various other improvements and bug fixes too.
  • LUCENE-5415: Add multitermquery (wildcards,prefix,etc) to PostingsHighlighter.
  • LUCENE-3069: Add two memory resident dictionaries (FST terms dictionary and FSTOrd terms dictionary) to improve primary key lookups. The PostingsBaseFormat API is also changed so that term dictionaries get the ability to block encode term metadata, and all dictionary implementations can now plug in any PostingsBaseFormat.
  • LUCENE-5353: ShingleFilter's filler token should be configurable.
  • LUCENE-5320: Add SearcherTaxonomyManager over search and taxonomy index directories (i.e. not only NRT).
  • LUCENE-5410: Add fuzzy and near support via '~' operator to SimpleQueryParser.
  • LUCENE-5426: Make SortedSetDocValuesReaderState abstract to allow custom implementations for Lucene doc values faceting
  • LUCENE-5434: NRT support for file systems that do no have delete on last close or cannot delete while referenced semantics.
  • LUCENE-5418: Drilling down or sideways on a Lucene facet range (using Range.getFilter()) is now faster for costly filters (uses random access, not iteration); range facet counts now accept a fast-match filter to avoid computing the value for documents that are out of bounds, e.g. using a bounding box filter with distance range faceting.
  • LUCENE-5440: Add LongBitSet for managing more than 2.1B bits (otherwise use FixedBitSet).
  • LUCENE-5437: ASCIIFoldingFilter now has an option to preserve the original token and emit it on the same position as the folded token only if the actual token was folded.
  • LUCENE-5408: Add spatial SerializedDVStrategy that serializes a binary representations of a shape into BinaryDocValues. It supports exact geometry relationship calculations.
  • LUCENE-5457: Add SloppyMath.earthDiameter(double latitude) that returns an approximate value of the diameter of the earth at the given latitude.
  • Build:
  • LUCENE-5217,LUCENE-5420: Maven config: get dependencies from Ant+Ivy config; disable transitive dependency resolution for all depended-on artifacts by putting an exclusion for each transitive dependency in the section of the grandparent POM.
  • LUCENE-5322: Clean up / simplify Maven-related Ant targets.
  • LUCENE-5347: Upgrade forbidden-apis checker to version 1.4.
  • LUCENE-4381: Upgrade analysis/icu to 52.1.
  • LUCENE-5357: Upgrade StandardTokenizer and UAX29URLEmailTokenizer to Unicode 6.3; update UAX29URLEmailTokenizer's recognized top level domains in URLs and Emails from the IANA Root Zone Database.
  • LUCENE-5360: Add support for developing in Netbeans IDE.
  • SOLR-5590: Upgrade HttpClient/HttpComponents to 4.3.x.
  • LUCENE-5385: "ant precommit" / "ant check-svn-working-copy" now work for SVN 1.8 or GIT checkouts. The ANT target prints a warning instead of failing. It also instructs the user, how to run on SVN 1.8 working copies.
  • LUCENE-5383: fix changes2html to link pull requests
  • LUCENE-5411: Upgrade to released JFlex 1.5.0; stop requiring a locally built JFlex snapshot jar.
  • LUCENE-5465: Solr Contrib "map-reduce" breaks Manifest of all other JAR files by adding a broken Main-Class attribute.
  • Bug fixes:
  • LUCENE-5285: Improved highlighting of multi-valued fields with FastVectorHighlighter.
  • LUCENE-5391: UAX29URLEmailTokenizer should not tokenize no-scheme domain-only URLs that are followed by an alphanumeric character.
  • LUCENE-5405: If an analysis component throws an exception, Lucene logs the field name to the info stream to assist in diagnosis.
  • SOLR-5661: PriorityQueue now refuses to allocate itself if the incoming maxSize is too large
  • LUCENE-5228: IndexWriter.addIndexes(Directory[]) now acquires a write lock in each Directory, to ensure that no open IndexWriter is changing the incoming indices. This also means that you cannot pass the same Directory to multiple concurrent addIndexes calls (which is anyways unusual).
  • LUCENE-5415: SpanMultiTermQueryWrapper didn't handle its boost in hashcode/equals/tostring/rewrite.
  • LUCENE-5409: ToParentBlockJoinCollector.getTopGroups would fail to return any groups when the joined query required more than one rewrite step
  • LUCENE-5398: NormValueSource was incorrectly casting the long value to byte, before calling Similarity.decodeNormValue.
  • LUCENE-5436: RefrenceManager#accquire can result in infinite loop if managed resource is abused outside of the RefrenceManager. Decrementing the reference without a corresponding incRef() call can cause an infinite loop. ReferenceManager now throws IllegalStateException if currently managed resources ref count is 0.
  • LUCENE-5443: Lucene45DocValuesProducer.ramBytesUsed() may throw ConcurrentModificationException.
  • LUCENE-5444: MemoryIndex did't respect the analyzers offset gap and offsets were corrupted if multiple fields with the same name were added to the memory index.
  • LUCENE-5447: StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet
  • LUCENE-5462: RamUsageEstimator.sizeOf(Object) is not used anymore to estimate memory usage of segments. This used to make SegmentReader.ramBytesUsed very CPU-intensive.
  • LUCENE-5461: ControlledRealTimeReopenThread would sometimes wait too long (up to targetMaxStaleSec) when a searcher is waiting for a specific generation, when it should have waited for at most targetMinStaleSec.
  • API Changes:
  • LUCENE-5339: The facet module was simplified/reworked to make the APIs more approachable to new users. Note: when migrating to the new API, you must pass the Document that is returned from FacetConfig.build() to IndexWriter.addDocument().
  • LUCENE-5405: Make ShingleAnalzyerWrapper.getWrappedAnalyzer() public final
  • LUCENE-5395: The SpatialArgsParser now only reads WKT, no more "lat, lon" etc. but it's easy to override the parseShape method if you wish.
  • LUCENE-5414: DocumentExpressionDictionary was renamed to DocumentValueSourceDictionary and all dependencies to the lucene-expression module were removed from lucene-suggest. DocumentValueSourceDictionary now only accepts a ValueSource instead of a convenience ctor for an expression string.
  • LUCENE-3069: PostingsWriterBase and PostingsReaderBase are no longer responsible for encoding/decoding a block of terms. Instead, they should encode/decode each term to/from a long[] and byte[].
  • LUCENE-5425: FacetsCollector and MatchingDocs use a general DocIdSet, allowing for custom implementations to be used when faceting.
  • Optimizations
  • LUCENE-5372: Replace StringBuffer by StringBuilder, where possible.
  • LUCENE-5271: A slightly more accurate SloppyMath distance.
  • LUCENE-5399: Deep paging using IndexSearcher.searchAfter when sorting by fields is faster
  • Changes in Runtime Behavior:
  • LUCENE-5362: IndexReader and SegmentCoreReaders now throw AlreadyClosedException if the refCount in incremented but is less that 1.
  • Documentation:
  • LUCENE-5384: Add some tips for making tokenfilters and tokenizers to the analysis package overview.
  • LUCENE-5389: Add more guidance in the analyis documentation package overview.

New in Apache Lucene 4.6.1 (Jan 28, 2014)

  • Bug fixes:
  • LUCENE-5373: Memory usage of [Lucene40/Lucene42/Memory/Direct]DocValuesFormat was over-estimated.
  • LUCENE-5361: Fixed handling of query boosts in FastVectorHighlighter.
  • LUCENE-5374: IndexWriter processes internal events after the it closed itself internally. This rare condition can happen if an IndexWriter has internal changes that were not fully applied yet like when index / flush requests happen concurrently to the close or rollback call.
  • LUCENE-5394: Fix TokenSources.getTokenStream to return payloads if they were indexed with the term vectors.
  • LUCENE-5344: Flexible StandardQueryParser behaves differently than ClassicQueryParser.
  • LUCENE-5375: ToChildBlockJoinQuery works harder to detect mis-use, when the parent query incorrectly returns child documents, and throw a clear exception saying so.
  • LUCENE-5401: Field.StringTokenStream#end() calls super.end() now, preventing wrong term positions for fields that use StringTokenStream.
  • LUCENE-5377: IndexWriter.addIndexes(Directory[]) would cause corruption on Lucene 4.6 if any index segments were Lucene 4.0-4.5.

New in Apache Lucene 4.6.0 (Nov 27, 2013)

  • New Features:
  • LUCENE-4906: PostingsHighlighter can now render to custom Object, for advanced use cases where String is too restrictive
  • LUCENE-5133: Changed AnalyzingInfixSuggester.highlight to return Object instead of String, to allow for advanced use cases where String is too restrictive
  • LUCENE-5207, LUCENE-5334: Added expressions module for customizing ranking with script-like syntax.
  • Jack Conradson, Ryan Ernst, Uwe Schindler via Robert Muir)
  • LUCENE-5180: ShingleFilter now creates shingles with trailing holes, for example if a StopFilter had removed the last token.
  • LUCENE-5219: Add support to SynonymFilterFactory for custom parsers.
  • LUCENE-5235: Tokenizers now throw an IllegalStateException if the consumer does not call reset() before consuming the stream. Previous versions throwed NullPointerException or ArrayIndexOutOfBoundsException on best effort which was not user-friendly.
  • LUCENE-5240: Tokenizers now throw an IllegalStateException if the consumer neglects to call close() on the previous stream before consuming the next one.
  • LUCENE-5214: Add new FreeTextSuggester, to predict the next word using a simple ngram language model. This is useful for the "long tail" suggestions, when a primary suggester fails to find a suggestion.
  • LUCENE-5251: New DocumentDictionary allows building suggesters via contents of existing field, weight and optionally payload stored fields in an index
  • LUCENE-5261: Add QueryBuilder, a simple API to build queries from the analysis chain directly, or to make it easier to implement query parsers.
  • LUCENE-5270: Add Terms.hasFreqs, to determine whether a given field indexed per-doc term frequencies.
  • LUCENE-5269: Add CodepointCountFilter.
  • LUCENE-5294: Suggest module: add DocumentExpressionDictionary to compute each suggestion's weight using a javascript expression.
  • LUCENE-5274: FastVectorHighlighter now supports highlighting against several indexed fields.
  • LUCENE-5304: SingletonSortedSetDocValues can now return the wrapped SortedDocValues
  • LUCENE-2844: The benchmark module can now test the spatial module. See spatial.alg
  • LUCENE-5302: Make StemmerOverrideMap's methods public
  • LUCENE-5296: Add DirectDocValuesFormat, which holds all doc values in heap as uncompressed java native arrays.
  • LUCENE-5189: Add IndexWriter.updateNumericDocValues, to update numeric DocValues fields of documents, without re-indexing them.
  • LUCENE-5298: Add SumValueSourceFacetRequest for aggregating facets by a ValueSource, such as a NumericDocValuesField or an expression.
  • LUCENE-5323: Add .sizeInBytes method to all suggesters (Lookup).
  • LUCENE-5312: Add BlockJoinSorter, a new Sorter implementation that makes sure to never split up blocks of documents indexed with IndexWriter.addDocuments.
  • LUCENE-5297: Allow to range-facet on any ValueSource, not just NumericDocValues fields.
  • Bug Fixes:
  • LUCENE-5272: OpenBitSet.ensureCapacity did not modify numBits, causing false assertion errors in fastSet.
  • LUCENE-5303: OrdinalsCache did not use coreCacheKey, resulting in over caching across multiple threads.
  • LUCENE-5307: Fix topScorer inconsistency in handling QueryWrapperFilter inside ConstantScoreQuery, which now rewrites to a query removing the obsolete QueryWrapperFilter.
  • LUCENE-5330: IndexWriter didn't process all internal events on #getReader(), #close() and #rollback() which causes files to be deleted at a later point in time. This could cause short-term disk pollution or OOM if in-memory directories are used.
  • LUCENE-5342: Fixed bulk-merge issue in CompressingStoredFieldsFormat which created corrupted segments when mixing chunk sizes. Lucene41StoredFieldsFormat is not impacted.
  • API Changes:
  • LUCENE-5222: Add SortField.needsScores(). Previously it was not possible for a custom Sort that makes use of the relevance score to work correctly with IndexSearcher when an ExecutorService is specified.
  • LUCENE-5275: Change AttributeSource.toString() to display the current state of attributes.
  • LUCENE-5277: Modify FixedBitSet copy constructor to take an additional numBits parameter to allow growing/shrinking the copied bitset. You can use FixedBitSet.clone() if you only need to clone the bitset.
  • LUCENE-5260: Use TermFreqPayloadIterator for all suggesters; those suggesters that can't support payloads will throw an exception if hasPayloads() is true.
  • LUCENE-5280: Rename TermFreqPayloadIterator -> InputIterator, along with associated suggest/spell classes.
  • LUCENE-5157: Rename OrdinalMap methods to clarify API and internal structure.
  • LUCENE-5313: Move preservePositionIncrements from setter to ctor in Analyzing/FuzzySuggester.
  • LUCENE-5321: Remove Facet42DocValuesFormat. Use DirectDocValuesFormat if you want to load the category list into memory.
  • LUCENE-5324: AnalyzerWrapper.getPositionIncrementGap and getOffsetGap can now be overridden.
  • Optimizations:
  • LUCENE-5225: The ToParentBlockJoinQuery only keeps tracks of the the child doc ids and child scores if the ToParentBlockJoinCollector is used.
  • LUCENE-5236: EliasFanoDocIdSet now has an index and uses broadword bit selection to speed-up advance().
  • LUCENE-5266: Improved number of read calls and branches in DirectPackedReader.
  • LUCENE-5300: Optimized SORTED_SET storage for fields which are single-valued.
  • Documentation:
  • LUCENE-5211: Better javadocs and error checking of 'format' option in StopFilterFactory, as well as comments in all snowball formated files about specifying format option.
  • Changes in backwards compatibility policy:
  • LUCENE-5235: Sub classes of Tokenizer have to call super.reset() when implementing reset(). Otherwise the consumer will get an IllegalStateException because the Reader is not correctly assigned. It is important to never change the "input" field on Tokenizer without using setReader(). The "input" field must not be used outside reset(), incrementToken(), or end() - especially not in the constructor.
  • LUCENE-5204: Directory doesn't have default implementations for LockFactory-related methods, which have been moved to BaseDirectory. If you had a custom Directory implementation that extended Directory, you need to extend BaseDirectory instead.

New in Apache Lucene 4.5.1 (Oct 25, 2013)

  • Bug Fixes:
  • LUCENE-4998: Fixed a few places to pass IOContext.READONCE instead of IOContext.READ
  • LUCENE-5242: DirectoryTaxonomyWriter.replaceTaxonomy did not fully reset its state, which could result in exceptions being thrown, as well as incorrect ordinals returned from getParent.
  • LUCENE-5254: Fixed bounded memory leak, where objects like live docs bitset were not freed from an starting reader after reopening to a new reader and closing the original one.
  • LUCENE-5262: Fixed file handle leaks when multiple attempts to open an NRT reader hit exceptions.
  • LUCENE-5263: Transient IOExceptions, e.g. due to disk full or file descriptor exhaustion, hit at unlucky times inside IndexWriter could lead to silently losing deletions.
  • LUCENE-5264: CommonTermsQuery ignored minMustMatch if only high-frequent terms were present in the query and the high-frequent operator was set to SHOULD.
  • LUCENE-5269: Fix bug in NGramTokenFilter where it would sometimes count unicode characters incorrectly.
  • LUCENE-5289: IndexWriter.hasUncommittedChanges was returning false when there were buffered delete-by-Term.

New in Apache Lucene 4.5.0 (Oct 7, 2013)

  • New features:
  • LUCENE-5084: Added new Elias-Fano encoder, decoder and DocIdSet implementations.
  • LUCENE-5081: Added WAH8DocIdSet, an in-memory doc id set implementation based on word-aligned hybrid encoding.
  • LUCENE-5098: New broadword utility methods in oal.util.BroadWord.
  • LUCENE-5030: FuzzySuggester now supports optional unicodeAware (default is false). If true then edits are measured in Unicode code points instead of UTF8 bytes.
  • LUCENE-5118: SpatialStrategy.makeDistanceValueSource() now has an optional multiplier for scaling degrees to another unit.
  • LUCENE-5091: SpanNotQuery can now be configured with pre and post slop to act as a hypothetical SpanNotNearQuery.
  • LUCENE-4985: FacetsAccumulator.create() is now able to create a MultiFacetsAccumulator over a mixed set of facet requests. MultiFacetsAccumulator allows wrapping multiple FacetsAccumulators, allowing to easily mix existing and custom ones. TaxonomyFacetsAccumulator supports any FacetRequest which implements createFacetsAggregator and was indexed using the taxonomy index.
  • LUCENE-5153: AnalyzerWrapper.wrapReader allows wrapping the Reader given to inputReader.
  • LUCENE-5155: FacetRequest.getValueOf and .getFacetArraysSource replaced by FacetsAggregator.createOrdinalValueResolver. This gives better options for resolving an ordinal's value by FacetAggregators.
  • LUCENE-5165: Add SuggestStopFilter, to be used with analyzing suggesters, so that a stop word at the very end of the lookup query, and without any trailing token characters, will be preserved. This enables query "a" to suggest apple;
  • LUCENE-5178: Added support for missing values to DocValues fields. AtomicReader.getDocsWithField returns a Bits of documents with a value, and FieldCache.getDocsWithField forwards to that for DocValues fields. Things like SortField.setMissingValue, FunctionValues.exists, and FieldValueFilter now work with DocValues fields.
  • LUCENE-5124: Lucene 4.5 has a new Lucene45Codec with Lucene45DocValues, supporting missing values and with most datastructures residing off-heap. Added "Memory" docvalues format that works entirely in heap, and "Disk" loads no datastructures into RAM. Both of these also support missing values. Added DiskNormsFormat (in case you want norms entirely on disk).
  • LUCENE-2750: Added PForDeltaDocIdSet, an in-memory doc id set implementation based on the PFOR encoding.
  • LUCENE-5186: Added CachingWrapperFilter.getFilter in order to be able to get the wrapped filter.
  • LUCENE-5197: Added SegmentReader.ramBytesUsed to return approximate heap RAM used by index datastructures.
  • Bug Fixes:
  • LUCENE-5116: IndexWriter.addIndexes(IndexReader...) should drop empty (or all deleted) segments.
  • LUCENE-5132: Spatial RecursivePrefixTree Contains predicate will throw an NPE when there's no indexed data and maybe in other circumstances too.
  • LUCENE-5146: AnalyzingSuggester sort comparator read part of the input key as the weight that caused the sorter to never sort by weight first since the weight is only considered if the input is equal causing the malformed weight to be identical as well.
  • LUCENE-5151: Associations FacetsAggregators could enter an infinite loop when some result documents were missing category associations.
  • LUCENE-5152: Fix MemoryPostingsFormat to not modify borrowed BytesRef from FSTEnum seek/lookup which can cause sideeffects if done on a cached FST root arc.
  • LUCENE-5160: Handle the case where reading from a file or FileChannel returns -1, which could happen in rare cases where something happens to the file between the time we start the read loop (where we check the length) and when we actually do the read.
  • LUCENE-5166: PostingsHighlighter would throw IOOBE if a term spanned the maxLength boundary, made it into the top-N and went to the formatter.
  • LUCENE-4583: Indexing core no longer enforces a limit on maximum length binary doc values fields, but individual codecs (including the default one) have their own limits
  • LUCENE-3849: TokenStreams now set the position increment in end(), so we can handle trailing holes. If you have a custom TokenStream implementing end() then be sure it calls super.end().
  • LUCENE-5192: IndexWriter could allow adding same field name with different DocValueTypes under some circumstances.
  • LUCENE-5191: SimpleHTMLEncoder in Highlighter module broke Unicode outside BMP because it encoded UTF-16 chars instead of codepoints. The escaping of codepoints > 127 was removed (not needed for valid HTML) and missing escaping for ' and / was added.
  • LUCENE-5201: Fixed compression bug in LZ4.compressHC when the input is highly compressible and the start offset of the array to compress is > 0.
  • LUCENE-5221: SimilarityBase did not write norms the same way as DefaultSimilarity if discountOverlaps == false and index-time boosts are present for the field.
  • LUCENE-5223: Fixed IndexUpgrader command line parsing: -verbose is not required and -dir-impl option now works correctly.
  • LUCENE-5245: Fix MultiTermQuery's constant score rewrites to always return a ConstantScoreQuery to make scoring consistent. Previously it returned an empty unwrapped BooleanQuery, if no terms were available, which has a different query norm.
  • LUCENE-5218: In some cases, trying to retrieve or merge a 0-length binary doc value would hit an ArrayIndexOutOfBoundsException.
  • API Changes:
  • LUCENE-5094: Add ramBytesUsed() to MultiDocValues.OrdinalMap.
  • LUCENE-5114: Remove unused boolean useCache parameter from TermsEnum.seekCeil and .seekExact
  • LUCENE-5128: IndexSearcher.searchAfter throws IllegalArgumentException if searchAfter exceeds the number of documents in the reader.
  • LUCENE-5129: CategoryAssociationsContainer no longer supports null association values for categories. If you want to index categories without associations, you should add them using FacetFields.
  • LUCENE-4876: IndexWriter no longer clones the given IndexWriterConfig. If you need to use the same config more than once, e.g. when sharing between multiple writers, make sure to clone it before passing to each writer.
  • LUCENE-5144: StandardFacetsAccumulator renamed to OldFacetsAccumulator, and all associated classes were moved under o.a.l.facet.old. The intention to remove it one day, when the features it covers (complements, partitiona, sampling) will be migrated to the new FacetsAggregator and FacetsAccumulator API. Also, FacetRequest.createAggregator was replaced by OldFacetsAccumulator.createAggregator.
  • LUCENE-5149: CommonTermsQuery now allows to set the minimum number of terms that should match for its high and low frequent sub-queries. Previously this was only supported on the low frequent terms query.
  • LUCENE-5156: CompressingTermVectors TermsEnum no longer supports ord().
  • LUCENE-5161, LUCENE-5164: Fix default chunk sizes in FSDirectory to not be unnecessarily large (now 8192 bytes); also use chunking when writing to index files. FSDirectory#setReadChunkSize() is now deprecated and will be removed in Lucene 5.0.
  • LUCENE-5170: Analyzer.ReuseStrategy instances are now stateless and can be reused in other Analyzer instances, which was not possible before. Lucene ships now with stateless singletons for per field and global reuse. Legacy code can still instantiate the deprecated implementation classes, but new code should use the constants. Implementors of custom strategies have to take care of new method signatures. AnalyzerWrapper can now be configured to use a custom strategy, too, ideally the one from the wrapped Analyzer. Analyzer adds a getter to retrieve the strategy for this use-case.
  • LUCENE-5173: Lucene never writes segments with 0 documents anymore.
  • LUCENE-5178: SortedDocValues always returns -1 ord when a document is missing a value for the field. Previously it only did this if the SortedDocValues was produced by uninversion on the FieldCache.
  • LUCENE-5183: remove BinaryDocValues.MISSING. In order to determine a document is missing a field, use getDocsWithField instead.
  • Changes in Runtime Behavior:
  • LUCENE-5178: DocValues codec consumer APIs (iterables) return null values when the document has no value for the field.
  • LUCENE-5200: The HighFreqTerms command-line tool returns the true top-N by totalTermFreq when using the -t option, it uses the term statistics (faster) and now always shows totalTermFreq in the output.
  • Optimizations:
  • LUCENE-5088: Added TermFilter to filter docs by a specific term.
  • LUCENE-5119: DiskDV keeps the document-to-ordinal mapping on disk for SortedDocValues.
  • LUCENE-5145: New AppendingPackedLongBuffer, a new variant of the former AppendingLongBuffer which assumes values are 0-based.
  • LUCENE-5145: All Appending*Buffer now support bulk get.
  • LUCENE-5140: Fixed a performance regression of span queries caused by LUCENE-4946.
  • LUCENE-5150: Make WAH8DocIdSet able to inverse its encoding in order to compress dense sets efficiently as well.
  • LUCENE-5159: Prefix-code the sorted/sortedset value dictionaries in DiskDV.
  • LUCENE-5170: Fixed several wrapper analyzers to inherit the reuse strategy of the wrapped Analyzer.
  • LUCENE-5006: Simplified DocumentsWriter and DocumentsWriterPerThread synchronization and concurrent interaction with IndexWriter. DWPT is now only setup once and has no reset logic. All segment publishing and state transition from DWPT into IndexWriter is now done via an Event-Queue processed from within the IndexWriter in order to prevent suituations where DWPT or DW calling int IW causing deadlocks.
  • LUCENE-5182: Terminate phrase searches early if max phrase window is exceeded in FastVectorHighlighter to prevent very long running phrase extraction if phrase terms are high frequent.
  • LUCENE-5188: CompressingStoredFieldsFormat now slices chunks containing big documents into fixed-size blocks so that requesting a single field does not necessarily force to decompress the whole chunk.
  • LUCENE-5101: CachingWrapper makes it easier to plug-in a custom cacheable DocIdSet implementation and uses WAH8DocIdSet by default, which should be more memory efficient than FixedBitSet on average as well as faster on small sets.
  • Documentation:
  • LUCENE-4894: remove facet userguide as it was outdated. Partially absorbed into package's documentation and classes javadocs.
  • LUCENE-5206: Clarify FuzzyQuery's unexpected behavior on short terms.
  • Changes in backwards compatibility policy:
  • LUCENE-5141: CheckIndex.fixIndex(Status,Codec) is now CheckIndex.fixIndex(Status). If you used to pass a codec to this method, just remove it from the arguments.
  • LUCENE-5089, SOLR-5126: Update to Morfologik 1.7.1. MorfologikAnalyzer and MorfologikFilter no longer support multiple "dictionaries" as there is only one dictionary available.
  • LUCENE-5170: Changed method signatures of Analyzer.ReuseStrategy to take Analyzer. Closeable interface was removed because the class was changed to be stateless.
  • LUCENE-5187: SlowCompositeReaderWrapper constructor is now private, SlowCompositeReaderWrapper.wrap should be used instead.
  • LUCENE-5101: CachingWrapperFilter doesn't always return FixedBitSet instances anymore. Users of the join module can use oal.search.join.FixedBitSetCachingWrapperFilter instead.
  • Build:
  • SOLR-5159: Manifest includes non-parsed maven variables.
  • LUCENE-5193: Add jar-src as top-level target to generate all Lucene and Solr *-src.jar.

New in Apache Lucene 4.4.0 (Jul 24, 2013)

  • Changes in backwards compatibility policy:
  • LUCENE-5085: MorfologikFilter will no longer stem words marked as keywords
  • LUCENE-4955: NGramTokenFilter now emits all n-grams for the same token at the same position and preserves the position length and the offsets of the original token.
  • LUCENE-4955: NGramTokenizer now emits n-grams in a different order (a, ab, b, bc, c) instead of (a, b, c, ab, bc) and doesn't trim trailing whitespaces.
  • LUCENE-5042: The n-gram and edge n-gram tokenizers and filters now correctly handle supplementary characters, and the tokenizers have the ability to pre-tokenize the input stream similarly to CharTokenizer.
  • LUCENE-4967: NRTManager is replaced by ControlledRealTimeReopenThread, for controlling which requests must see which indexing changes, so that it can work with any ReferenceManager
  • LUCENE-4973: SnapshotDeletionPolicy no longer requires a unique String id
  • LUCENE-4946: The internal sorting API (SorterTemplate, now Sorter) has been completely refactored to allow for a better implementation of TimSort.
  • LUCENE-4963: Some TokenFilter options that generate broken TokenStreams have been deprecated: updateOffsets=true on TrimFilter and enablePositionIncrements=false on all classes that inherit from FilteringTokenFilter: JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, StopFilter and TypeTokenFilter.
  • LUCENE-4963: In order not to take position increments into account in suggesters, you now need to call setPreservePositionIncrements(false) instead of configuring the token filters to not increment positions.
  • LUCENE-3907: EdgeNGramTokenizer now supports maxGramSize > 1024, doesn't trim the input, sets position increment = 1 for all tokens and doesn't support backward grams anymore.
  • LUCENE-3907: EdgeNGramTokenFilter does not support backward grams and does not update offsets anymore.
  • LUCENE-4981: PositionFilter is now deprecated as it can corrupt token stream graphs. Since it main use-case was to make query parsers generate boolean queries instead of phrase queries, it is now advised to use QueryParser.setAutoGeneratePhraseQueries(false) (for simple cases) or to override QueryParser.newFieldQuery.
  • LUCENE-5018: CompoundWordTokenFilterBase and its children DictionaryCompoundWordTokenFilter and HyphenationCompoundWordTokenFilter don't update offsets anymore.
  • LUCENE-5015: SamplingAccumulator no longer corrects the counts of the sampled categories. You should set TakmiSampleFixer on SamplingParams if required (but notice that this means slower search).
  • LUCENE-4933: Replace ExactSimScorer/SloppySimScorer with just SimScorer. Previously there were 2 implementations as a performance hack to support tableization of sqrt(), but this caching is removed, as sqrt is implemented in hardware with modern jvms and its faster not to cache.
  • LUCENE-5038: MergePolicy now has a default implementation for useCompoundFile based on segment size and noCFSRatio. The default implemantion was pulled up from TieredMergePolicy.
  • LUCENE-5063: FieldCache.get(Bytes|Shorts), SortField.Type.(BYTE|SHORT) and FieldCache.DEFAULT_(BYTE|SHORT|INT|LONG|FLOAT|DOUBLE)_PARSER are now deprecated. These methods/types assume that data is stored as strings although Lucene has much better support for numeric data through (Int|Long)Field, NumericRangeQuery and FieldCache.get(Int|Long)s.
  • LUCENE-5078: TfIDFSimilarity lets you encode the norm value as any arbitrary long. As a result, encode/decodeNormValue were made abstract with their signatures changed. The default implementation was moved to DefaultSimilarity, which encodes the norm as a single-byte value.
  • Bug Fixes:
  • LUCENE-4890: QueryTreeBuilder.getBuilder() only finds interfaces on the most derived class.
  • LUCENE-4997: Internal test framework's tests are sensitive to previous test failures and tests.failfast.
  • LUCENE-4955: NGramTokenizer now supports inputs larger than 1024 chars.
  • LUCENE-4959: Fix incorrect return value in SimpleNaiveBayesClassifier.assignClass.
  • LUCENE-4972: DirectoryTaxonomyWriter created empty commits even if no changes were made.
  • LUCENE-949: AnalyzingQueryParser can't work with leading wildcards.
  • LUCENE-4980: Fix issues preventing mixing of RangeFacetRequest and non-RangeFacetRequest when using DrillSideways.
  • LUCENE-4996: Ensure DocInverterPerField always includes field name in exception messages.
  • LUCENE-4992: Fix constructor of CustomScoreQuery to take FunctionQuery for scoringQueries. Instead use QueryValueSource to safely wrap arbitrary queries and use them with CustomScoreQuery.
  • LUCENE-5016: SamplingAccumulator returned inconsistent label if asked to aggregate a non-existing category. Also fixed a bug in RangeAccumulator if some readers did not have the requested numeric DV field.
  • LUCENE-5028: Remove pointless and confusing doShare option in FST's PositiveIntOutputs
  • LUCENE-5032: Fix IndexOutOfBoundsExc in PostingsHighlighter when multi-valued fields exceed maxLength
  • LUCENE-4933: SweetSpotSimilarity didn't apply its tf function to some queries (SloppyPhraseQuery, SpanQueries).
  • LUCENE-5033: SlowFuzzyQuery was accepting too many terms (documents) when provided minSimilarity is an int > 1
  • LUCENE-5045: DrillSideways.search did not work on an empty index.
  • LUCENE-4995: CompressingStoredFieldsReader now only reuses an internal buffer when there is no more than 32kb to decompress. This prevents from running into out-of-memory errors when working with large stored fields.
  • LUCENE-5062: If the spatial data for a document was comprised of multiple overlapping or adjacent parts then a CONTAINS predicate query might not match when the sum of those shapes contain the query shape but none do individually. A flag was added to use the original faster algorithm.
  • LUCENE-4971: Fixed NPE in AnalyzingSuggester when there are too many graph expansions.
  • LUCENE-5080: Combined setMaxMergeCount and setMaxThreadCount into one setter in ConcurrentMergePolicy: setMaxMergesAndThreads. Previously these setters would not work unless you invoked them very carefully.
  • LUCENE-5068: QueryParserUtil.escape() does not escape forward slash.
  • LUCENE-5103: A join on A single-valued field with deleted docs scored too few docs.
  • LUCENE-5090: Detect mismatched readers passed to SortedSetDocValuesReaderState and SortedSetDocValuesAccumulator.
  • LUCENE-5120: AnalyzingSuggester modifed it's FST's cached root arc if payloads are used and the entire output resided on the root arc on the first access. This caused subsequent suggest calls to fail.
  • Optimizations:
  • LUCENE-4936: Improve numeric doc values compression in case all values share a common divisor. In particular, this improves the compression ratio of dates without time when they are encoded as milliseconds since Epoch. Also support TABLE compressed numerics in the Disk codec.
  • LUCENE-4951: DrillSideways uses the new Scorer.cost() method to make better decisions about which scorer to use internally.
  • LUCENE-4976: PersistentSnapshotDeletionPolicy writes its state to a single snapshots_N file, and no longer requires closing
  • LUCENE-5035: Compress addresses in FieldCacheImpl.SortedDocValuesImpl more efficiently.
  • LUCENE-4941: Sort "from" terms only once when using JoinUtil.
  • LUCENE-5050: Close the stored fields and term vectors index files as soon as the index has been loaded into memory to save file descriptors.
  • LUCENE-5086: RamUsageEstimator now uses official Java 7 API or a proprietary Oracle Java 6 API to get Hotspot MX bean, preventing AWT classes to be loaded on MacOSX.
  • New Features:
  • LUCENE-5085: MorfologikFilter will no longer stem words marked as keywords
  • LUCENE-5064: Added PagedMutable (internal), a paged extension of PackedInts.Mutable which allows for storing more than 2B values.
  • LUCENE-4766: Added a PatternCaptureGroupTokenFilter that uses Java regexes to emit multiple tokens one for each capture group in one or more patterns.
  • LUCENE-4952: Expose control (protected method) in DrillSideways to force all sub-scorers to be on the same document being collected. This is necessary when using collectors like ToParentBlockJoinCollector with DrillSideways.
  • SOLR-4761: Add SimpleMergedSegmentWarmer, which just initializes terms, norms, docvalues, and so on.
  • LUCENE-4964: Allow arbitrary Query for per-dimension drill-down to DrillDownQuery and DrillSideways, to support future dynamic faceting methods
  • LUCENE-4966: Add CachingWrapperFilter.sizeInBytes()
  • LUCENE-4965: Add dynamic (no taxonomy index used) numeric range faceting to Lucene's facet module
  • LUCENE-4979: LiveFieldFields can work with any ReferenceManager, not just ReferenceManager
  • LUCENE-4975: Added a new Replicator module which can replicate index revisions between server and client.
  • LUCENE-5022: Added FacetResult.mergeHierarchies to merge multiple FacetResult of the same dimension into a single one with the reconstructed hierarchy.
  • LUCENE-5026: Added PagedGrowableWriter, a new internal packed-ints structure that grows the number of bits per value on demand, can store more than 2B values and supports random write and read access.
  • LUCENE-5025: FST's Builder can now handle more than 2.1 billion "tail nodes" while building a minimal FST.
  • LUCENE-5063: FieldCache.DEFAULT.get(Ints|Longs) now uses bit-packing to save memory.
  • LUCENE-5079: IndexWriter.hasUncommittedChanges() returns true if there are changes that have not been committed.
  • SOLR-4565: Extend NorwegianLightStemFilter and NorwegianMinimalStemFilter to handle "nynorsk"
  • LUCENE-5087: Add getMultiValuedSeparator to PostingsHighlighter, for cases where you want a different logical separator between field values. This can be set to e.g. U+2029 PARAGRAPH SEPARATOR if you never want passes to span values.
  • LUCENE-5013: Added ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory
  • LUCENE-4845: AnalyzingInfixSuggester finds suggestions based on matches to any tokens in the suggestion, not just based on pure prefix matching.
  • API Changes:
  • LUCENE-5077: Make it easier to use compressed norms. Lucene42NormsFormat takes an overhead parameter, so you can easily pass a different value other than PackedInts.FASTEST from your own codec.
  • LUCENE-5097: Analyzer now has an additional tokenStream(String fieldName, String text) method, so wrapping by StringReader for common use is no longer needed. This method uses an internal reuseable reader, which was previously only used by the Field class.
  • LUCENE-4542: HunspellStemFilter's maximum recursion level is now configurable.
  • Build:
  • LUCENE-4987: Upgrade randomized testing to version 2.0.10: Test framework may fail internally due to overly aggresive J9 optimizations.
  • LUCENE-5043: The eclipse target now uses the containing directory for the project name. This also enforces UTF-8 encoding when files are copied with filtering.
  • LUCENE-5055: "rat-sources" target now checks also build.xml, ivy.xml, forbidden-api signatures, and parts of resources folders.
  • LUCENE-5072: Automatically patch javadocs generated by JDK versions before 7u25 to work around the frame injection vulnerability (CVE-2013-1571, VU#225657).
  • Tests:
  • LUCENE-4901: TestIndexWriterOnJRECrash should work on any JRE vendor via Runtime.halt().
  • Changes in runtime behavior:
  • LUCENE-5038: New segments written by IndexWriter are now wrapped into CFS by default. DocumentsWriterPerThread doesn't consult MergePolicy anymore to decide if a CFS must be written, instead IndexWriterConfig now has a property to enable / disable CFS for newly created segments.
  • LUCENE-5107: Properties files by Lucene are now written in UTF-8 encoding, Unicode is no longer escaped. Reading of legacy properties files with \u escapes is still possible.

New in Apache Lucene 4.3.1 (Jun 20, 2013)

  • Bug Fixes:
  • SOLR-4813: Fix SynonymFilterFactory to allow init parameters for tokenizer factory used when parsing synonyms file.
  • LUCENE-4935: CustomScoreQuery wrongly applied its query boost twice (boost^2).
  • LUCENE-4948: Fixed ArrayIndexOutOfBoundsException in PostingsHighlighter if you had a 64-bit JVM without compressed OOPS: IBM J9, or Oracle with large heap/explicitly disabled.
  • LUCENE-4953: Fixed ParallelCompositeReader to inform ReaderClosedListeners of its synthetic subreaders. FieldCaches keyed on the atomic childs will be purged earlier and FC insanity prevented. In addition, ParallelCompositeReader's toString() was changed to better reflect the reader structure.
  • LUCENE-4968: Fixed ToParentBlockJoinQuery/Collector: correctly handle parent hits that had no child matches, don't throw IllegalArgumentEx when the child query has no hits, more aggressively catch cases where childQuery incorrectly matches parent documents
  • LUCENE-4970: Fix boost value of rewritten NGramPhraseQuery.
  • LUCENE-4974: CommitIndexTask was broken if no params were set.
  • LUCENE-4986: Fixed case where a newly opened near-real-time reader fails to reflect a delete from IndexWriter.tryDeleteDocument
  • LUCENE-4991: Fix handling of synonyms in classic QueryParser.getFieldQuery for terms not separated by whitespace. PositionIncrementAttribute was ignored, so with default AND synonyms wrongly became mandatory clauses, and with OR, the coordination factor was wrong.
  • LUCENE-4994: Fix PatternKeywordMarkerFilter to have public constructor.
  • LUCENE-4993: Fix BeiderMorseFilter to preserve custom attributes when inserting tokens with position increment 0.
  • LUCENE-5002: IndexWriter#deleteAll() caused a deadlock in DWPT / DWSC if a DwPT was flushing concurrently while deleteAll() aborted all DWPT. The IW should never wait on DWPT via the flush control while holding on to the IW Lock.
  • Optimizations:
  • LUCENE-4938: Don't use an unnecessarily large priority queue in IndexSearcher methods that take top-N.

New in Apache Lucene 4.3.0 (May 3, 2013)

  • Changes in backwards compatibility policy:
  • LUCENE-4810: EdgeNGramTokenFilter no longer increments position for multiple ngrams derived from the same input token.
  • LUCENE-4822: KeywordTokenFilter is now an abstract class. Subclasses need to implement #isKeyword() in order to mark terms as keywords. The existing functionality has been factored out into a new SetKeywordTokenFilter class.
  • LUCENE-4642: Remove Tokenizer's and subclasses' ctors taking AttributeSource.
  • LUCENE-4833: IndexWriterConfig used to use LogByteSizeMergePolicy when calling setMergePolicy(null) although the default merge policy is TieredMergePolicy. IndexWriterConfig setters now throw an exception when passed null if null is not a valid value.
  • LUCENE-4849: Made ParallelTaxonomyArrays abstract with a concrete implementation for DirectoryTaxonomyWriter/Reader. Also moved it under o.a.l.facet.taxonomy.
  • LUCENE-4876: IndexDeletionPolicy is now an abstract class instead of an interface. IndexDeletionPolicy, MergeScheduler and InfoStream now implement Cloneable.
  • LUCENE-4874: FilterAtomicReader and related classes (FilterTerms, FilterDocsEnum, ...) don't forward anymore to the filtered instance when the method has a default implementation through other abstract methods.
  • LUCENE-4642, LUCENE-4877: Implementors of TokenizerFactory, TokenFilterFactory, and CharFilterFactory now need to provide at least one constructor taking Map to be able to be loaded by the SPI framework (e.g., from Solr). In addition, TokenizerFactory needs to implement the abstract create(AttributeFactory,Reader) method.
  • API Changes:
  • LUCENE-4896: Made PassageFormatter abstract in PostingsHighlighter, made members of DefaultPassageFormatter protected.
  • LUCENE-4844: removed TaxonomyReader.getParent(), you should use TaxonomyReader.getParallelArrays().parents() instead.
  • LUCENE-4742: Renamed spatial 'Node' to 'Cell', along with any method names and variables using this terminology.
  • New Features:
  • LUCENE-4815: DrillSideways now allows more than one FacetRequest per dimension
  • LUCENE-3918: IndexSorter has been ported to 4.3 API and now supports sorting documents by a numeric DocValues field, or reverse the order of the documents in the index. Additionally, apps can implement their own sort criteria.
  • LUCENE-4817: Added KeywordRepeatFilter that allows to emit a token twice once as a keyword and once as an ordinary token allow stemmers to emit a stemmed version along with the un-stemmed version.
  • LUCENE-4822: PatternKeywordTokenFilter can mark tokens as keywords based on regular expressions.
  • LUCENE-4821: AnalyzingSuggester now uses the ending offset to determine whether the last token was finished or not, so that a query "i " will no longer suggest "Isla de Muerta" for example.
  • LUCENE-4642: Add create(AttributeFactory) to TokenizerFactory and subclasses with ctors taking AttributeFactory.
  • LUCENE-4820: Add payloads to Analyzing/FuzzySuggester, to record an arbitrary byte[] per suggestion
  • LUCENE-4816: Add WholeBreakIterator to PostingsHighlighter for treating the entire content as a single Passage.
  • LUCENE-4827: Add additional ctor to PostingsHighlighter PassageScorer to provide bm25 k1,b,avgdl parameters.
  • LUCENE-4607: Add DocIDSetIterator.cost() and Spans.cost() for optimizing scoring.
  • LUCENE-4795: Add SortedSetDocValuesFacetFields and SortedSetDocValuesAccumulator, to compute topK facet counts from a field's SortedSetDocValues. This method only supports flat (dim/label) facets, is a bit (~25%) slower, has added cost per-IndexReader-open to compute its ordinal map, but it requires no taxonomy index and it tie-breaks facet labels in an understandable (by Unicode sort order) way.
  • LUCENE-4843: Add LimitTokenPositionFilter: don't emit tokens with positions that exceed the configured limit.
  • LUCENE-4832: Add ToParentBlockJoinCollector.getTopGroupsWithAllChildDocs, to retrieve all children in each group.
  • LUCENE-4846: PostingsHighlighter subclasses can override where the String values come from (it still defaults to pulling from stored fields).
  • LUCENE-4853: Add PostingsHighlighter.highlightFields method that takes int[] docIDs instead of TopDocs.
  • LUCENE-4856: If there are no matches for a given field, return the first maxPassages sentences
  • LUCENE-4859: IndexReader now exposes Terms statistics: getDocCount, getSumDocFreq, getSumTotalTermFreq.
  • LUCENE-4862: It is now possible to terminate collection of a single IndexReader leaf by throwing a CollectionTerminatedException in Collector.collect.
  • LUCENE-4752: New SortingMergePolicy (in lucene/misc) that sorts documents before merging segments.
  • LUCENE-4860: Customize scoring and formatting per-field in PostingsHighlighter by subclassing and overriding the getFormatter and/or getScorer methods. This also changes Passage.getMatchTerms() to return BytesRef[] instead of Term[].
  • LUCENE-4839: Added SorterTemplate.timSort, a O(n log n) stable sort algorithm that performs well on partially sorted data.
  • LUCENE-4644: Added support for the "IsWithin" spatial predicate for RecursivePrefixTreeStrategy. It's for matching non-point indexed shapes; if you only have points (1/doc) then "Intersects" is equivalent and faster. See the javadocs.
  • LUCENE-4861: Make BreakIterator per-field in PostingsHighlighter. This means you can override getBreakIterator(String field) to use different mechanisms for e.g. title vs. body fields.
  • LUCENE-4645: Added support for the "Contains" spatial predicate for RecursivePrefixTreeStrategy.
  • LUCENE-4898: DirectoryReader.openIfChanged now allows opening a reader on an IndexCommit starting from a near-real-time reader (previously this would throw IllegalArgumentException).
  • LUCENE-4905: Made the maxPassages parameter per-field in PostingsHighlighter.
  • LUCENE-4897: Added TaxonomyReader.getChildren for traversing a category's children.
  • LUCENE-4902: Added FilterDirectoryReader to allow easy filtering of a DirectoryReader's subreaders.
  • LUCENE-4858: Added EarlyTerminatingSortingCollector to be used in conjunction with SortingMergePolicy, which allows to early terminate queries on sorted indexes, when the sort order matches the index order.
  • LUCENE-4904: Added descending sort order to NumericDocValuesSorter.
  • LUCENE-3786: Added SearcherTaxonomyManager, to manage access to both IndexSearcher and DirectoryTaxonomyReader for near-real-time faceting.
  • LUCENE-4915: DrillSideways now allows drilling down on fields that are not faceted.
  • LUCENE-4895: Added support for the "IsDisjointTo" spatial predicate for RecursivePrefixTreeStrategy.
  • LUCENE-4774: Added FieldComparator that allows sorting parent documents based on fields on the child / nested document level.
  • Optimizations:
  • LUCENE-4839: SorterTemplate.merge can now be overridden in order to replace the default implementation which merges in-place by a faster implementation that could require fewer swaps at the expense of some extra memory. ArrayUtil and CollectionUtil override it so that their mergeSort and timSort methods are faster but only require up to 1% of extra memory.
  • LUCENE-4571: Speed up BooleanQuerys with minNrShouldMatch to use skipping.
  • LUCENE-4863: StemmerOverrideFilter now uses a FST to represent its overrides in memory.
  • LUCENE-4889: UnicodeUtil.codePointCount implementation replaced with a non-array-lookup version.
  • LUCENE-4923: Speed up BooleanQuerys processing of in-order disjunctions.
  • LUCENE-4926: Speed up DisjunctionMatchQuery.
  • LUCENE-4930: Reduce contention in older/buggy JVMs when using AttributeSource#addAttribute() because java.lang.ref.ReferenceQueue#poll() is implemented using synchronization.
  • Bug Fixes:
  • LUCENE-4868: SumScoreFacetsAggregator used an incorrect index into the scores array.
  • LUCENE-4882: FacetsAccumulator did not allow to count ROOT category (i.e. count dimensions).
  • LUCENE-4876: IndexWriterConfig.clone() now clones its MergeScheduler, IndexDeletionPolicy and InfoStream in order to make an IndexWriterConfig and its clone fully independent.
  • LUCENE-4893: Facet counts were multiplied as many times as FacetsCollector.getFacetResults() is called.
  • LUCENE-4888: Fixed SloppyPhraseScorer, MultiDocs(AndPositions)Enum and MultiSpansWrapper which happened to sometimes call DocIdSetIterator.advance with target

New in Apache Lucene 4.2.1 (Apr 4, 2013)

  • Bug Fixes:
  • LUCENE-4713: The SPI components used to load custom codecs or analysis components were fixed to also scan the Lucene ClassLoader in addition to the context ClassLoader, so Lucene is always able to find its own codecs. The special case of a null context ClassLoader is now also supported.
  • LUCENE-4819: seekExact(BytesRef, boolean) did not work correctly with Sorted[Set]DocValuesTermsEnum.
  • LUCENE-4826: PostingsHighlighter was not returning the top N best scoring passages.
  • LUCENE-4854: Fix DocTermOrds.getOrdTermsEnum() to not return negative ord on initial next().
  • LUCENE-4836: Fix SimpleRateLimiter#pause to return the actual time spent sleeping instead of the wakeup timestamp in nano seconds.
  • LUCENE-4828: BooleanQuery no longer extracts terms from its MUST_NOT clauses.
  • SOLR-4589: Fixed CPU spikes and poor performance in lazy field loading of multivalued fields.
  • LUCENE-4870: Fix bug where an entire index might be deleted by the IndexWriter due to false detection if an index exists in the directory when OpenMode.CREATE_OR_APPEND is used. This might also affect application that set the open mode manually using DirectoryReader#indexExists.
  • LUCENE-4878: Override getRegexpQuery in MultiFieldQueryParser to prefent NullPointerException when regular expression syntax is used with MultiFieldQueryParser.
  • Optimizations:
  • LUCENE-4819: Added Sorted[Set]DocValues.termsEnum(), and optimized the default codec for improved enumeration performance.
  • LUCENE-4854: Speed up TermsEnum of FieldCache.getDocTermOrds.
  • LUCENE-4857: Don't unnecessarily copy stem override map in StemmerOverrideFilter.

New in Apache Lucene 4.2.0 (Mar 12, 2013)

  • Changes in backwards compatibility policy:
  • LUCENE-4602: FacetFields now stores facet ordinals in a DocValues field, rather than a payload. This forces rebuilding existing indexes, or do a one time migration using FacetsPayloadMigratingReader. Since DocValues support in-memory caching, CategoryListCache was removed too. (Shai Erera, Michael McCandless)
  • LUCENE-4697: FacetResultNode is now a concrete class with public members (instead of getter methods). (Shai Erera)
  • LUCENE-4600: FacetsCollector is now an abstract class with two implementations: StandardFacetsCollector (the old version of FacetsCollector) and CountingFacetsCollector. FacetsCollector.create() returns the most optimized collector for the given parameters. (Shai Erera, Michael McCandless)
  • LUCENE-4700: OrdinalPolicy is now per CategoryListParams, and is no longer an interface, but rather an enum with values NO_PARENTS and ALL_PARENTS. PathPolicy was removed, you should extend FacetFields and DrillDownStream to control which categories are added as drill-down terms. (Shai Erera)
  • LUCENE-4547: DocValues improvements:
  • Simplified codec API: codecs are now only responsible for encoding and decoding docvalues, they do not need to do buffering or RAM accounting.
  • Per-Field support: added PerFieldDocValuesFormat, which allows you to use a different DocValuesFormat per field (like postings).
  • Unified with FieldCache api: DocValues can be accessed via FieldCache API, so it works automatically with grouping/join/sort/function queries, etc.
  • Simplified types: There are only 3 types (NUMERIC, BINARY, SORTED), so its not necessary to specify for example that all of your binary values have the same length. Instead its easy for the Codec API to optimize encoding based on any properties of the content. (Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
  • LUCENE-4757: Cleanup and refactoring of FacetsAccumulator, FacetRequest, FacetsAggregator and FacetResultsHandler API. If your application did FacetsCollector.create(), you should not be affected, but if you wrote an Aggregator, then you should migrate it to the per-segment FacetsAggregator. You can still use StandardFacetsAccumulator, which works with the old API (for now). (Shai Erera)
  • LUCENE-4761: Facet packages reorganized. Should be easy to fix your import statements, if you use an IDE such as Eclipse. (Shai Erera)
  • LUCENE-4750: Convert DrillDown to DrillDownQuery, so you can initialize it and add drill-down categories to it. (Michael McCandless, Shai Erera)
  • LUCENE-4759: remove FacetRequest.SortBy; result categories are always sorted by value, while ties are broken by category ordinal. (Shai Erera)
  • LUCENE-4772: Facet associations moved to new FacetsAggregator API. You should override FacetsAccumualtor and return the relevant aggregator, for aggregating the association values. (Shai Erera)
  • LUCENE-4748: A FacetRequest on a non-existent field now returns an empty FacetResult instead of skipping it. (Shai Erera, Mike McCandless)
  • LUCENE-4806: The default category delimiter character was changed from U+F749 to U+001F, since the latter uses 1 byte vs 3 bytes for the former. Existing facet indices must be reindexed. (Robert Muir, Shai Erera, Mike McCandless)
  • Optimizations:
  • LUCENE-4687: BloomFilterPostingsFormat now lazily initializes delegate TermsEnum only if needed to do a seek or get a DocsEnum. (Simon Willnauer)
  • LUCENE-4677, LUCENE-4682: unpacked FSTs now use vInt to encode the node target, to reduce their size (Mike McCandless)
  • LUCENE-4678: FST now uses a paged byte[] structure instead of a single byte[] internally, to avoid large memory spikes during building (James Dyer, Mike McCandless)
  • LUCENE-3298: FST can now be larger than 2.1 GB / 2.1 B nodes. (James Dyer, Mike McCandless)
  • LUCENE-4690: Performance improvements and non-hashing versions of NumericUtils.*ToPrefixCoded() (yonik)
  • LUCENE-4715: CategoryListParams.getOrdinalPolicy now allows to return a different OrdinalPolicy per dimension, to better tune how you index facets. Also added OrdinalPolicy.ALL_BUT_DIMENSION. (Shai Erera, Michael McCandless)
  • LUCENE-4740: Don't track clones of MMapIndexInput if unmapping is disabled. This reduces GC overhead. (Kristofer Karlsson, Uwe Schindler)
  • LUCENE-4733: The default Lucene 4.2 codec now uses a more compact TermVectorsFormat (Lucene42TermVectorsFormat) based on CompressingTermVectorsFormat. (Adrien Grand)
  • LUCENE-3729: The default Lucene 4.2 codec now uses a more compact DocValuesFormat (Lucene42DocValuesFormat). Sorted values are stored in an FST, Numerics and Ordinals use a number of strategies (delta-compression, table-compression, etc), and memory addresses use MonotonicBlockPackedWriter. (Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
  • LUCENE-4792: Reduction of the memory required to build the doc ID maps used when merging segments. (Adrien Grand)
  • LUCENE-4794: Spatial RecursivePrefixTreeStrategy's search filter: Skip calls to termsEnum.seek() when the next term is known to follow the current cell. (David Smiley)
  • New Features:
  • LUCENE-4686: New specialized DGapVInt8IntEncoder for facets (now the default). (Shai Erera)
  • LUCENE-4703: Add simple PrintTaxonomyStats tool to see summary information about the facets taxonomy index. (Mike McCandless)
  • LUCENE-4599: New oal.codecs.compressing.CompressingTermVectorsFormat which compresses term vectors into chunks of documents similarly to CompressingStoredFieldsFormat. (Adrien Grand)
  • LUCENE-4695: Added LiveFieldValues utility class, for getting the current (live, real-time) value for any indexed doc/field. The class buffers recently indexed doc/field values until a new near-real-time reader is opened that contains those changes. (Robert Muir, Mike McCandless)
  • LUCENE-4723: Add AnalyzerFactoryTask to benchmark, and enable analyzer creation via the resulting factories using NewAnalyzerTask. (Steve Rowe)
  • LUCENE-4728: Unknown and not explicitly mapped queries are now rewritten against the highlighting IndexReader to obtain primitive queries before discarding the query entirely. WeightedSpanTermExtractor now builds a MemoryIndex only once even if multiple fields are highlighted. (Simon Willnauer)
  • LUCENE-4035: Added ICUCollationDocValuesField, more efficient support for Locale-sensitive sort and range queries for single-valued fields. (Robert Muir)
  • LUCENE-4547: Added MonotonicBlockPacked(Reader/Writer), which provide efficient random access to large amounts of monotonically increasing positive values (e.g. file offsets). Each block stores the minimum value and the average gap, and values are encoded as signed deviations from the expected value. (Adrien Grand)
  • LUCENE-4547: Added AppendingLongBuffer, an append-only buffer that packs signed long values in memory and provides an efficient iterator API. (Adrien Grand)
  • LUCENE-4540: It is now possible for a codec to represent norms with less than 8 bits per value. For performance reasons this is not done by default, but you can customize your codec (e.g. pass PackedInts.DEFAULT to Lucene42DocValuesConsumer) if you want to make this tradeoff. (Adrien Grand, Robert Muir)
  • LUCENE-4764: A new Facet42Codec and Facet42DocValuesFormat provide faster but more RAM-consuming facet performance. (Shai Erera, Mike McCandless)
  • LUCENE-4769: Added OrdinalsCache and CachedOrdsCountingFacetsAggregator which uses the cache to obtain a document's ordinals. This aggregator is faster than others, however consumes much more RAM. (Michael McCandless, Shai Erera)
  • LUCENE-4778: Add a getter for the delegate in RateLimitedDirectoryWrapper. (Mark Miller)
  • LUCENE-4765: Add a multi-valued docvalues type (SORTED_SET). This is equivalent to building a FieldCache.getDocTermOrds at index-time. (Robert Muir)
  • LUCENE-4780: Add MonotonicAppendingLongBuffer: an append-only buffer for monotonically increasing values. (Adrien Grand)
  • LUCENE-4748: Added DrillSideways utility class for computing both drill-down and drill-sideways counts for a DrillDownQuery. (Mike McCandless)
  • API Changes:
  • LUCENE-4709: FacetResultNode no longer has a residue field. (Shai Erera)
  • LUCENE-4716: DrillDown.query now takes Occur, allowing to specify if categories should be OR'ed or AND'ed. (Shai Erera)
  • LUCENE-4695: ReferenceManager.RefreshListener.afterRefresh now takes a boolean indicating whether a new reference was in fact opened, and a new beforeRefresh method notifies you when a refresh attempt is starting. (Robert Muir, Mike McCandless)
  • LUCENE-4794: Spatial RecursivePrefixTreeFilter replaced by IntersectsPrefixTreeFilter and some extensible base classes. (David Smiley)
  • Bug Fixes:
  • LUCENE-4705: Pass on FilterStrategy in FilteredQuery if the filtered query is rewritten. (Simon Willnauer)
  • LUCENE-4712: MemoryIndex#normValues() throws NPE if field doesn't exist. (Simon Willnauer, Ricky Pritchett)
  • LUCENE-4550: Shapes wider than 180 degrees would use too much accuracy for the PrefixTree based SpatialStrategy. For a pathological case of nearly 360 degrees and barely any height, it would generate so many indexed terms (> 500k) that it could even cause an OutOfMemoryError. Fixed. (David Smiley)
  • LUCENE-4704: Make join queries override hashcode and equals methods. (Martijn van Groningen)
  • LUCENE-4724: Fix bug in CategoryPath which allowed passing null or empty string components. This is forbidden now (throws an exception). Note that if you have a taxonomy index created with such strings, you should rebuild it. (Michael McCandless, Shai Erera)
  • LUCENE-4732: Fixed TermsEnum.seekCeil/seekExact on term vectors. (Adrien Grand, Robert Muir)
  • LUCENE-4739: Fixed bugs that prevented FSTs more than ~1.1GB from being saved and loaded (Adrien Grand, Mike McCandless)
  • LUCENE-4717: Fixed bug where Lucene40DocValuesFormat would sometimes write an extra unused ordinal for sorted types. The bug is detected and corrected on-the-fly for old indexes. (Robert Muir)
  • LUCENE-4547: Fixed bug where Lucene40DocValuesFormat was unable to encode segments that would exceed 2GB total data. This could happen in some surprising cases, for example if you had an index with more than 260M documents and a VAR_INT field. (Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
  • LUCENE-4775: Remove SegmentInfo.sizeInBytes() and make MergePolicy.OneMerge.totalBytesSize thread safe (Josh Bronson via Robert Muir, Mike McCandless)
  • LUCENE-4770: If spatial's TermQueryPrefixTreeStrategy was used to search indexed non-point shapes, then there was an edge case where a query should find a shape but it didn't. The fix is the removal of an optimization that simplifies some leaf cells into a parent. The index data for such a field is now ~20% larger. This optimization is still done for the query shape, and for indexed data for RecursivePrefixTreeStrategy. Furthermore, this optimization is enhanced to roll up beyond the bottom cell level. (David Smiley, Florian Schilling)
  • LUCENE-4790: Fix FieldCacheImpl.getDocTermOrds to not bake deletes into the cached datastructure. Otherwise this can cause inconsistencies with readers at different points in time. (Robert Muir)
  • LUCENE-4791: A conjunction of terms (ConjunctionTermScorer) scanned on the lowest frequency term instead of skipping, leading to potentially large performance impacts for many non-random or non-uniform term distributions. (John Wang, yonik)
  • LUCENE-4798: PostingsHighlighter's formatter sometimes didn't highlight matched terms. (Robert Muir)
  • LUCENE-4796, SOLR-4373: Fix concurrency issue in NamedSPILoader and AnalysisSPILoader when doing reload (e.g. from Solr). (Uwe Schindler, Hossman)
  • LUCENE-4802: Don't compute norms for drill-down facet fields. (Mike McCandless)
  • LUCENE-4804: PostingsHighlighter sometimes applied terms to the wrong passage, if they started exactly on a passage boundary. (Robert Muir)
  • Documentation:
  • LUCENE-4718: Fixed documentation of oal.queryparser.classic. (Hayden Muhl via Adrien Grand)
  • LUCENE-4784, LUCENE-4785, LUCENE-4786: Fixed references to deprecated classes SinkTokenizer, ValueSourceQuery and RangeQuery. (Hao Zhong via Adrien Grand)
  • Build:
  • LUCENE-4654: Test duration statistics from multiple test runs should be reused. (Dawid Weiss)
  • LUCENE-4636: Upgrade ivy to 2.3.0 (Shawn Heisey via Robert Muir)
  • LUCENE-4570: Use the Policeman Forbidden API checker, released separately from Lucene and downloaded via Ivy. (Uwe Schindler, Robert Muir)
  • LUCENE-4758: 'ant jar', 'ant compile', and 'ant compile-test' should recurse. (Steve Rowe)

New in Apache Lucene 4.1.0 (Jan 23, 2013)

  • Changes in backwards compatibility policy:
  • LUCENE-4514: Scorer's freq() method returns an integer value indicating the number of times the scorer matches the current document. Previously this was only sometimes the case, in some cases it returned a (meaningless) floating point value. Scorer now extends DocsEnum so it has attributes().
  • LUCENE-4543: TFIDFSimilarity's index-time computeNorm is now final to match the fact that its query-time norm usage requires a FIXED_8 encoding. Override lengthNorm and/or encode/decodeNormValue to change the specifics, like Lucene 3.x.
  • LUCENE-3441: The facet module now supports NRT. As a result, the following changes were made:
  • DirectoryTaxonomyReader has a new constructor which takes a DirectoryTaxonomyWriter. You should use that constructor in order to get the NRT support (or the old one for non-NRT).
  • TaxonomyReader.refresh() removed in exchange for TaxonomyReader.openIfChanged static method. Similar to DirectoryReader, the method either returns null if no changes were made to the taxonomy, or a new TR instance otherwise. Instead of calling refresh(), you should write similar code to how you reopen a regular DirectoryReader.
  • TaxonomyReader.openIfChanged (previously refresh()) no longer throws InconsistentTaxonomyException, and supports recreate. InconsistentTaxoEx was removed.
  • ChildrenArrays was pulled out of TaxonomyReader into a top-level class.
  • TaxonomyReader was made an abstract class (instead of an interface), with methods such as close() and reference counting management pulled from DirectoryTaxonomyReader, and made final. The rest of the methods, remained abstract.
  • LUCENE-4576: Remove CachingWrapperFilter(Filter, boolean). This recacheDeletes option gave less than 1% speedup at the expense of cache churn (filters were invalidated on reopen if even a single delete was posted against the segment).
  • LUCENE-4575: Replace IndexWriter's commit/prepareCommit versions that take commitData with setCommitData(). That allows committing changes to IndexWriter even if the commitData is the only thing that changes.
  • LUCENE-4565: TaxonomyReader.getParentArray and .getChildrenArrays consolidated into one getParallelTaxonomyArrays(). You can obtain the 3 arrays that the previous two methods returned by calling parents(), children() or siblings() on the returned ParallelTaxonomyArrays.
  • LUCENE-4585: Spatial PrefixTree based Strategies (either TermQuery or RecursivePrefix based) MAY want to re-index if used for point data. If a re-index is not done, then an indexed point is ~1/2 the smallest grid cell larger and as such is slightly more likely to match a query shape.
  • LUCENE-4604: DefaultOrdinalPolicy removed in favor of OrdinalPolicy.ALL_PARENTS. Same for DefaultPathPolicy (now PathPolicy.ALL_CATEGORIES). In addition, you can use OrdinalPolicy.NO_PARENTS to never write any parent category ordinal to the fulltree posting payload (but note that you need a special FacetsAccumulator - see javadocs).
  • LUCENE-4594: Spatial PrefixTreeStrategy no longer indexes center points of non-point shapes. If you want to call makeDistanceValueSource() based on shape centers, you need to do this yourself in another spatial field.
  • LUCENE-4615: Replace IntArrayAllocator and FloatArrayAllocator by ArraysPool. FacetArrays no longer takes those allocators; if you need to reuse the arrays, you should use ReusingFacetArrays.
  • LUCENE-4621: FacetIndexingParams is now a concrete class (instead of DefaultFIP). Also, the entire IndexingParams chain is now immutable. If you need to override a setting, you should extend the relevant class. Additionally, FacetSearchParams is now immutable, and requires all FacetRequests to specified at initialization time.
  • LUCENE-4647: CategoryDocumentBuilder and EnhancementsDocumentBuilder are replaced by FacetFields and AssociationsFacetFields respectively. CategoryEnhancement and AssociationEnhancement were removed in favor of a simplified CategoryAssociation interface, with CategoryIntAssociation and CategoryFloatAssociation implementations. NOTE: indexes that contain category enhancements/associations are not supported by the new code and should be recreated.
  • LUCENE-4659: Massive cleanup to CategoryPath API. Additionally, CategoryPath is now immutable, so you don't need to clone() it.
  • LUCENE-4670: StoredFieldsWriter and TermVectorsWriter have new finish* callbacks which are called after a doc/field/term has been completely added.
  • LUCENE-4620: IntEncoder/Decoder were changed to do bulk encoding/decoding. As a result, few other classes such as Aggregator and CategoryListIterator were changed to handle bulk category ordinals.
  • LUCENE-4683: CategoryListIterator and Aggregator are now per-segment. As such their implementations no longer take a top-level IndexReader in the constructor but rather implement a setNextReader.
  • New Features:
  • LUCENE-4226: New experimental StoredFieldsFormat that compresses chunks of documents together in order to improve the compression ratio.
  • LUCENE-4426: New ValueSource implementations (in lucene/queries) for DocValues fields.
  • LUCENE-4410: FilteredQuery now exposes a FilterStrategy that exposes how filters are applied during query execution.
  • LUCENE-4404: New ListOfOutputs (in lucene/misc) for FSTs wraps another Outputs implementation, allowing you to store more than one output for a single input. UpToTwoPositiveIntsOutputs was moved from lucene/core to lucene/misc.
  • LUCENE-3842: New AnalyzingSuggester, for doing auto-suggest using an analyzer. This can create powerful suggesters: if the analyzer remove stop words then "ghost chr..." could suggest "The Ghost of Christmas Past"; if SynonymFilter is used to map wifi and wireless network to hotspot, then "wirele..." could suggest "wifi router"; token normalization likes stemmers, accent removel, etc. would allow the suggester to ignore such variations.
  • LUCENE-4446: Lucene 4.1 has a new default index format (Lucene41Codec) that incorporates the previously experimental "Block" postings format for better search performance.
  • LUCENE-3846: New FuzzySuggester, like AnalyzingSuggester except it also finds completions allowing for fuzzy edits in the input string.
  • LUCENE-4515: MemoryIndex now supports adding the same field multiple times.
  • LUCENE-4489: Added consumeAllTokens option to LimitTokenCountFilter
  • LUCENE-4566: Add NRT/SearcherManager.RefreshListener/addListener to be notified whenever a new searcher was opened.
  • SOLR-4123: Add per-script customizability to ICUTokenizerFactory via rule files in the ICU RuleBasedBreakIterator format.
  • LUCENE-4590: Added WriteEnwikiLineDocTask - a benchmark task for writing Wikipedia category pages and non-category pages into separate line files. extractWikipedia.alg was changed to use this task, so now it creates two files.
  • LUCENE-4290: Added PostingsHighlighter to the highlighter module. It uses offsets from the postings lists to highlight documents.
  • LUCENE-4628: Added CommonTermsQuery that executes high-frequency terms in a optional sub-query to prevent slow queries due to "common" terms like stopwords.
  • API Changes:
  • LUCENE-4399: Deprecated AppendingCodec. Lucene's term dictionaries no longer seek when writing.
  • LUCENE-4479: Rename TokenStream.getTokenStream(IndexReader, int, String) to TokenStream.getTokenStreamWithOffsets, and return null on failure rather than throwing IllegalArgumentException.
  • LUCENE-4472: MergePolicy now accepts a MergeTrigger that provides information about the trigger of the merge ie. merge triggered due to a segment merge or a full flush etc.
  • Lucene-4415: TermsFilter is now immutable. All terms need to be provided as constructor argument.
  • LUCENE-4520: ValueSource.getSortField no longer throws IOExceptions
  • LUCENE-4537: RateLimiter is now separated from FSDirectory and exposed via RateLimitingDirectoryWrapper. Any Directory can now be rate-limited.
  • LUCENE-4591: CompressingStoredFields{Writer,Reader} now accept a segment suffix as a constructor parameter.
  • LUCENE-4605: Added DocsEnum.FLAG_NONE which can be passed instead of 0 as the flag to .docs() and .docsAndPositions().
  • LUCENE-4617: Remove FST.pack() method. Previously to make a packed FST, you had to make a Builder with willPackFST=true (telling it you will later pack it), create your fst with finish(), and then call pack() to get another FST. Instead just pass true for doPackFST to Builder and finish() returns a packed FST.
  • LUCENE-4663: Deprecate IndexSearcher.document(int, Set). This was not intended to be final, nor named document(). Use IndexSearcher.doc(int, Set) instead.
  • LUCENE-4684: Made DirectSpellChecker extendable.
  • Bug Fixes:
  • LUCENE-1822: BaseFragListBuilder hard-coded 6 char margin is too naive.
  • LUCENE-4468: Fix rareish integer overflows in Lucene41 postings format.
  • LUCENE-4486: Add support for ConstantScoreQuery in Highlighter.
  • LUCENE-4485: When CheckIndex terms, terms/docs pairs and tokens, these counts now all exclude deleted documents.
  • LUCENE-4479: Highlighter works correctly for fields with term vector positions, but no offsets.
  • SOLR-3906: JapaneseReadingFormFilter in romaji mode will return romaji even for out-of-vocabulary kana cases (e.g. half-width forms).
  • LUCENE-4504: Fix broken sort comparator in ValueSource.getSortField, used when sorting by a function query.
  • LUCENE-4511: TermsFilter might return wrong results if a field is not indexed or doesn't exist in the index.
  • LUCENE-4521: IndexWriter.tryDeleteDocument could return true (successfully deleting the document) but then on IndexWriter close/commit fail to write the new deletions, if no other changes happened in the IndexWriter instance.
  • LUCENE-4513: Fixed that deleted nested docs are scored into the parent doc when using ToParentBlockJoinQuery.
  • LUCENE-4534: Fixed WFSTCompletionLookup and Analyzing/FuzzySuggester to allow 0 byte values in the lookup keys.
  • LUCENE-4532: DirectoryTaxonomyWriter use a timestamp to denote taxonomy index re-creation, which could cause a bug in case machine clocks were not synced. Instead, it now tracks an 'epoch' version, which is incremented whenever the taxonomy is re-created, or replaced.
  • LUCENE-4544: Fixed off-by-1 in ConcurrentMergeScheduler that would allow 1+maxMergeCount merges threads to be created, instead of just maxMergeCount
  • LUCENE-4567: Fixed NullPointerException in analyzing, fuzzy, and WFST suggesters when no suggestions were added
  • LUCENE-4568: Fixed integer overflow in PagedBytes.PagedBytesData{In,Out}put.getPosition.
  • LUCENE-4581: GroupingSearch.setAllGroups(true) was failing to actually compute allMatchingGroups
  • LUCENE-4009: Improve TermsFilter.toString
  • LUCENE-4588: Benchmark's EnwikiContentSource was discarding last wiki document and had leaking threads in 'forever' mode.
  • LUCENE-4585: Spatial RecursivePrefixTreeFilter had some bugs that only occurred when shapes were indexed. In what appears to be rare circumstances, documents with shapes near a query shape were erroneously considered a match. In addition, it wasn't possible to index a shape representing the entire globe.
  • LUCENE-4595: EnwikiContentSource had a thread safety problem (NPE) in 'forever' mode
  • LUCENE-4587: fix WordBreakSpellChecker to not throw AIOOBE when presented with 2-char codepoints, and to correctly break/combine terms containing non-latin characters.
  • LUCENE-4596: fix a concurrency bug in DirectoryTaxonomyWriter.
  • LUCENE-4594: Spatial PrefixTreeStrategy would index center-points in addition to the shape to index if it was non-point, in the same field. But sometimes the center-point isn't actually in the shape (consider a LineString), and for highly precise shapes it could cause makeDistanceValueSource's cache to load parts of the shape's boundary erroneously too. So center points aren't indexed any more; you should use another spatial field.
  • LUCENE-4629: IndexWriter misses to delete documents if a document block is indexed and the Iterator throws an exception. Documents were only rolled back if the actual indexing process failed.
  • LUCENE-4608: Handle large number of requested fragments better.
  • LUCENE-4633: DirectoryTaxonomyWriter.replaceTaxonomy did not refresh its internal reader, which could cause an existing category to be added twice.
  • LUCENE-4461: If you added the same FacetRequest more than once, you would get inconsistent results.
  • LUCENE-4656: Fix regression in IndexWriter to work with empty TokenStreams that have no TermToBytesRefAttribute (commonly provided by CharTermAttribute), e.g., oal.analysis.miscellaneous.EmptyTokenStream.
  • LUCENE-4660: ConcurrentMergeScheduler was taking too long to un-pause incoming threads it had paused when too many merges were queued up.
  • LUCENE-4662: Add missing elided articles and prepositions to FrenchAnalyzer's DEFAULT_ARTICLES list passed to ElisionFilter.
  • LUCENE-4671: Fix CharsRef.subSequence method.
  • LUCENE-4465: Let ConstantScoreQuery's Scorer return its child scorer.
  • Changes in Runtime Behavior:
  • LUCENE-4586: Change default ResultMode of FacetRequest to PER_NODE_IN_TREE. This only affects requests with depth>1. If you execute such requests and rely on the facet results being returned flat (i.e. no hierarchy), you should set the ResultMode to GLOBAL_FLAT.
  • Optimizations:
  • LUCENE-2221: oal.util.BitUtil was modified to use Long.bitCount and Long.numberOfTrailingZeros (which are intrinsics since Java 6u18) instead of pure java bit twiddling routines in order to improve performance on modern JVMs/hardware.
  • LUCENE-4509: Enable stored fields compression by default in the Lucene 4.1 default codec.
  • LUCENE-4536: PackedInts on-disk format is now byte-aligned (it used to be long-aligned), saving up to 7 bytes per array of values.
  • LUCENE-4512: Additional memory savings for CompressingStoredFieldsFormat.
  • LUCENE-4443: Lucene41PostingsFormat no longer writes unnecessary offsets into the skipdata.
  • LUCENE-4459: Improve WeakIdentityMap.keyIterator() to remove GCed keys from backing map early instead of waiting for reap(). This makes test failures in TestWeakIdentityMap disappear, too.
  • LUCENE-4473: Lucene41PostingsFormat encodes offsets more efficiently for low frequency terms (< 128 occurrences).
  • LUCENE-4462: DocumentsWriter now flushes deletes, segment infos and builds CFS files if necessary during segment flush and not during publishing. The latter was a single threaded process while now all IO and CPU heavy computation is done concurrently in DocumentsWriterPerThread.
  • LUCENE-4496: Optimize Lucene41PostingsFormat when requesting a subset of the postings data (via flags to TermsEnum.docs/docsAndPositions) to use ForUtil.skipBlock.
  • LUCENE-4497: Don't write PosVIntCount to the positions file in Lucene41PostingsFormat, as its always totalTermFreq % BLOCK_SIZE.
  • LUCENE-4498: In Lucene41PostingsFormat, when a term appears in only one document, Instead of writing a file pointer to a VIntBlock containing the doc id, just write the doc id.
  • LUCENE-4515: MemoryIndex now uses Byte/IntBlockPool internally to hold terms and posting lists. All index data is represented as consecutive byte/int arrays to reduce GC cost and memory overhead.
  • LUCENE-4538: DocValues now caches direct sources in a ThreadLocal exposed via SourceCache. Users of this API can now simply obtain an instance via DocValues#getDirectSource per thread.
  • LUCENE-4580: DrillDown.query variants return a ConstantScoreQuery with boost set to 0.0f so that documents scores are not affected by running a drill-down query.
  • LUCENE-4598: PayloadIterator no longer uses top-level IndexReader to iterate on the posting's payload.
  • LUCENE-4661: Drop default maxThreadCount to 1 and maxMergeCount to 2 in ConcurrentMergeScheduler, for faster merge performance on spinning-magnet drives
  • Documentation:
  • LUCENE-4483: Refer to BytesRef.deepCopyOf in Term's constructor that takes BytesRef.
  • Build:
  • LUCENE-4650: Upgrade randomized testing to version 2.0.8: make the test framework more robust under low memory conditions.
  • LUCENE-4603: Upgrade randomized testing to version 2.0.5: print forked JVM PIDs on heartbeat from hung tests
  • Upgrade randomized testing to version 2.0.4: avoid hangs on shutdown hooks hanging forever by calling Runtime.halt() in addition to Runtime.exit() after a short delay to allow graceful shutdown
  • LUCENE-4451: Memory leak per unique thread caused by RandomizedContext.contexts static map. Upgrade randomized testing to version 2.0.2
  • LUCENE-4589: Upgraded benchmark module's Nekohtml dependency to version 1.9.17, removing the workaround in Lucene's HTML parser for the Turkish locale.
  • LUCENE-4601: Fix ivy availability check to use typefound, so it works if called from another build file.

New in Apache Lucene 4.0.0 (Oct 12, 2012)

  • Changes in backwards compatibility policy:
  • LUCENE-4392: Class org.apache.lucene.util.SortedVIntList has been removed.
  • LUCENE-4393: RollingCharBuffer has been moved to the o.a.l.analysis.util package of lucene-analysis-common.
  • New Features:
  • LUCENE-1888: Added the option to store payloads in the term vectors (IndexableFieldType.storeTermVectorPayloads()). Note that you must store term vector positions to store payloads.
  • LUCENE-3892: Add a new BlockPostingsFormat that bulk-encodes docs, freqs and positions in large (size 128) packed-int blocks for faster search performance. This was from Han Jiang's 2012 Google Summer of Code project
  • LUCENE-4323: Added support for an absolute maximum CFS segment size (in MiB) to LogMergePolicy and TieredMergePolicy.
  • LUCENE-4339: Allow deletes against 3.x segments for easier upgrading. Lucene3x Codec is still otherwise read-only, you should not set it as the default Codec on IndexWriter, because it cannot write new segments.
  • SOLR-3441: ElisionFilterFactory is now MultiTermAware
  • API Changes:
  • LUCENE-4391, LUCENE-4440: All methods of Lucene40Codec but getPostingsFormatForField are now final. To reuse functionality of Lucene40, you should extend FilterCodec and delegate to Lucene40 instead of extending Lucene40Codec.
  • LUCENE-4299: Added Terms.hasPositions() and Terms.hasOffsets(). Previously you had no real way to know that a term vector field had positions or offsets, since this can be configured on a per-field-per-document basis.
  • Removed DocsAndPositionsEnum.hasPayload() and simplified the contract of getPayload(). It returns null if there is no payload, otherwise returns the current payload. You can now call it multiple times per position if you want.
  • Removed FieldsEnum. Fields API instead implements Iterable and exposes Iterator, so you can iterate over field names with for (String field : fields) instead.
  • LUCENE-4152: added IndexReader.leaves(), which lets you enumerate the leaf atomic reader contexts for all readers in the tree.
  • LUCENE-4304: removed PayloadProcessorProvider. If you want to change payloads (or other things) when merging indexes, its recommended to just use a FilterAtomicReader + IndexWriter.addIndexes. See the OrdinalMappingAtomicReader and TaxonomyMergeUtils in the facets module if you want an example of this.
  • LUCENE-4304: Make CompositeReader.getSequentialSubReaders() protected. To get atomic leaves of any IndexReader use the new method leaves() (LUCENE-4152), which lists AtomicReaderContexts including the doc base of each leaf.
  • LUCENE-4307: Renamed IndexReader.getTopReaderContext to IndexReader.getContext.
  • LUCENE-4316: Deprecate Fields.getUniqueTermCount and remove it from AtomicReader. If you really want the unique term count across all fields, just sum up Terms.size() across those fields. This method only exists so that this statistic can be accessed for Lucene 3.x segments, which don't support Terms.size().
  • LUCENE-4321: Change CharFilter to extend Reader directly, as FilterReader overdelegates (read(), read(char[], int, int), skip, etc). This made it hard to implement CharFilters that were correct. Instead only close() is delegated by default: read(char[], int, int) and correct(int) are abstract so that its obvious which methods you should implement. The protected inner Reader is 'input' like CharFilter in the 3.x series, instead of 'in'.
  • LUCENE-3309: The expert FieldSelector API, used to load only certain fields in a stored document, has been replaced with the simpler StoredFieldVisitor API.
  • LUCENE-4343: Made Tokenizer.setReader final. This is a setter that should not be overriden by subclasses: per-stream initialization should happen in reset().
  • LUCENE-4377: Remove IndexInput.copyBytes(IndexOutput, long). Use DataOutput.copyBytes(DataInput, long) instead.
  • LUCENE-4355: Simplify AtomicReader's sugar methods such as termDocsEnum, termPositionsEnum, docFreq, and totalTermFreq to only take Term as a parameter. If you want to do expert things such as pass a different Bits as liveDocs, then use the flex apis (fields(), terms(), etc) directly.
  • LUCENE-4425: clarify documentation of StoredFieldVisitor.binaryValue and simplify the api to binaryField(FieldInfo, byte[]).
  • Bug Fixes:
  • LUCENE-4423: DocumentStoredFieldVisitor.binaryField ignored offset and length.
  • LUCENE-4297: BooleanScorer2 would multiply the coord() factor twice for conjunctions: for most users this is no problem, but if you had a customized Similarity that returned something other than 1 when overlap == maxOverlap (always the case for conjunctions), then the score would be incorrect.
  • LUCENE-4298: MultiFields.getTermDocsEnum(IndexReader, Bits, String, BytesRef) did not work at all, it would infinitely recurse.
  • LUCENE-4300: BooleanQuery's rewrite was not always safe: if you had a custom Similarity where coord(1,1) != 1F, then the rewritten query would be scored differently.
  • Don't allow negatives in the positions file. If you have an index from 2.4.0 or earlier with such negative positions, and you already upgraded to 3.x, then to Lucene 4.0-ALPHA or -BETA, you should run CheckIndex. If it fails, then you need to upgrade again to 4.0
  • LUCENE-4303: PhoneticFilterFactory and SnowballPorterFilterFactory load their encoders / stemmers via the ResourceLoader now instead of Class.forName(). Solr users should now no longer have to embed these in its war.
  • SOLR-3737: StempelPolishStemFilterFactory loaded its stemmer table incorrectly. Also, ensure immutability and use only one instance of this table in RAM (lazy loaded) since its quite large.
  • LUCENE-4310: MappingCharFilter was failing to match input strings containing non-BMP Unicode characters.
  • LUCENE-4224: Add in-order scorer to query time joining and the out-of-order scorer throws an UOE.
  • LUCENE-4333: Fixed NPE in TermGroupFacetCollector when faceting on mv fields.
  • LUCENE-4218: Document.get(String) and Field.stringValue() again return values for numeric fields, like Lucene 3.x and consistent with the documentation.
  • NRTCachingDirectory was always caching a newly flushed segment in RAM, instead of checking the estimated size of the segment to decide whether to cache it.
  • LUCENE-3720: fix memory-consumption issues with BeiderMorseFilter.
  • LUCENE-4401: Fix bug where DisjunctionSumScorer would sometimes call score() on a subscorer that had already returned NO_MORE_DOCS.
  • LUCENE-4411: when sampling is enabled for a FacetRequest, its depth parameter is reset to the default (1), even if set otherwise.
  • LUCENE-4455: Fix bug in SegmentInfoPerCommit.sizeInBytes() that was returning 2X the true size, inefficiently. Also fixed bug in CheckIndex that would report no deletions when a segment has deletions, and vice/versa.
  • LUCENE-4456: Fixed double-counting sizeInBytes for a segment (affects how merge policies pick merges); fixed CheckIndex's incorrect reporting of whether a segment has deletions; fixed case where on abort Lucene could remove files it didn't create; fixed many cases where IndexWriter could leave leftover files (on exception in various places, on reuse of a segment name after crash and recovery.
  • Optimizations:
  • LUCENE-4322: Decrease lucene-core JAR size. The core JAR size had increased a lot because of generated code introduced in LUCENE-4161 and LUCENE-3892.
  • LUCENE-4317: Improve reuse of internal TokenStreams and StringReader in oal.document.Field.
  • LUCENE-4327: Support out-of-order scoring in FilteredQuery for higher performance.
  • LUCENE-4364: Optimize MMapDirectory to not make a mapping per-cfs-slice, instead one map per .cfs file. This reduces the total number of maps. Additionally factor out a (package-private) generic ByteBufferIndexInput from MMapDirectory.
  • Build:
  • LUCENE-4406, LUCENE-4407: Upgrade to randomizedtesting 2.0.1. Workaround for broken test output XMLs due to non-XML text unicode chars in strings. Added printing of failed tests at the end of a test run
  • LUCENE-4252: Detect/Fail tests when they leak RAM in static fields
  • LUCENE-4360: Support running the same test suite multiple times in parallel
  • LUCENE-3985: Upgrade to randomizedtesting 2.0.0. Added support for thread leak detection. Added support for suite timeouts.
  • LUCENE-4340: Move all non-default codec, postings format and terms dictionary implementations to lucene/codecs.
  • Documentation:
  • LUCENE-4302: Fix facet userguide to have HTML loose doctype like all other javadocs.

New in Apache Lucene 4.0 Beta (Aug 14, 2012)

  • New features:
  • LUCENE-4249: Changed the explanation of the PayloadTermWeight to use the underlying PayloadFunction's explanation as the explanation for the payload score. (Scott Smerchek via Robert Muir)
  • LUCENE-4069: Added BloomFilteringPostingsFormat for use with low-frequency terms such as primary keys (Mark Harwood, Mike McCandless)
  • LUCENE-4201: Added JapaneseIterationMarkCharFilter to normalize Japanese iteration marks. (Robert Muir, Christian Moen)
  • LUCENE-3832: Added BasicAutomata.makeStringUnion method to efficiently create automata from a fixed collection of UTF-8 encoded BytesRef (Dawid Weiss, Robert Muir)
  • LUCENE-4153: Added option to fast vector highlighting via BaseFragmentsBuilder to respect field boundaries in the case of highlighting for multivalued fields. (Martijn van Groningen)
  • LUCENE-4227: Added DirectPostingsFormat, to hold all postings in memory as uncompressed simple arrays. This uses a tremendous amount of RAM but gives good search performance gains. (Mike McCandless)
  • LUCENE-2510, LUCENE-4044: Migrated Solr's Tokenizer-, TokenFilter-, and CharFilterFactories to the lucene-analysis module. The API is still experimental. (Chris Male, Robert Muir, Uwe Schindler)
  • LUCENE-4230: When pulling a DocsAndPositionsEnum you can now specify whether or not you require payloads (in addition to offsets); turning one or both off may allow some codec implementations to optimize the enum implementation. (Robert Muir, Mike McCandless)
  • LUCENE-4203: Add IndexWriter.tryDeleteDocument(AtomicReader reader, int docID), to attempt deletion by docID as long as the provided reader is an NRT reader, and the segment has not yet been merged away (Mike McCandless).
  • LUCENE-4286: Added option to CJKBigramFilter to always also output unigrams. This can be used for a unigram+bigram approach, or at index-time only for better support of short queries. (Tom Burton-West, Robert Muir)
  • API Changes:
  • LUCENE-4138: update of morfologik (Polish morphological analyzer) to 1.5.3. The tag attribute class has been renamed to MorphosyntacticTagsAttribute and has a different API (carries a list of tags instead of a compound tag). Upgrade of embedded morfologik dictionaries to version 1.9. (Dawid Weiss)
  • LUCENE-4178: set 'tokenized' to true on FieldType by default, so that if you make a custom FieldType and set indexed = true, its analyzed by the analyzer. (Robert Muir)
  • LUCENE-4220: Removed the buggy JavaCC-based HTML parser in the benchmark module and replaced by NekoHTML. HTMLParser interface was cleaned up while changing method signatures. (Uwe Schindler, Robert Muir)
  • LUCENE-2191: Rename Tokenizer.reset(Reader) to Tokenizer.setReader(Reader). The purpose of this method was always to set a new Reader on the Tokenizer, reusing the object. But the name was often confused with TokenStream.reset(). (Robert Muir)
  • LUCENE-4228: Refactored CharFilter to extend java.io.FilterReader. CharFilters filter another reader and you override correct() for offset correction. (Robert Muir)
  • LUCENE-4240: Analyzer api now just takes fieldName for getOffsetGap. If the field is not analyzed (e.g. StringField), then the analyzer is not invoked at all. If you want to tweak things like positionIncrementGap and offsetGap, analyze the field with KeywordTokenizer instead. (Grant Ingersoll, Robert Muir)
  • LUCENE-4250: Pass fieldName to the PayloadFunction explain method, so it parallels with docScore and the default implementation is correct. (Robert Muir)
  • LUCENE-3747: Support Unicode 6.1.0. (Steve Rowe)
  • LUCENE-3884: Moved ElisionFilter out of org.apache.lucene.analysis.fr package into org.apache.lucene.analysis.util. (Robert Muir)
  • LUCENE-4230: When pulling a DocsAndPositionsEnum you now pass an int flags instead of the previous boolean needOffsets. Currently recognized flags are DocsAndPositionsEnum.FLAG_PAYLOADS and DocsAndPositionsEnum.FLAG_OFFSETS (Robert Muir, Mike McCandless)
  • LUCENE-4273: When pulling a DocsEnum, you can pass an int flags instead of the previous boolean needsFlags; consistent with the changes for DocsAndPositionsEnum in LUCENE-4230. Currently othe only flag is DocsEnum.FLAG_FREQS. (Robert Muir, Mike McCandless)
  • LUCENE-3616: TextField(String, Reader, Store) was reduced to TextField(String, Reader), as the Store parameter didn't make sense: if you supplied Store.YES, you would only receive an exception anyway. (Robert Muir)
  • Optimizations:
  • LUCENE-4171: Performance improvements to Packed64. (Toke Eskildsen via Adrien Grand)
  • LUCENE-4184: Performance improvements to the aligned packed bits impl. (Toke Eskildsen, Adrien Grand)
  • LUCENE-4235: Remove enforcing of Filter rewrite for NRQ queries. (Uwe Schindler)
  • LUCENE-4279: Regenerated snowball Stemmers from snowball r554, making them substantially more lightweight. Behavior is unchanged. (Robert Muir)
  • LUCENE-4291: Reduced internal buffer size for Jflex-based tokenizers such as StandardTokenizer from 32kb to 8kb. (Raintung Li, Steven Rowe, Robert Muir)
  • Bug Fixes:
  • LUCENE-4109: BooleanQueries are not parsed correctly with the flexible query parser. (Karsten Rauch via Robert Muir)
  • LUCENE-4176: Fix AnalyzingQueryParser to analyze range endpoints as bytes, so that it works correctly with Analyzers that produce binary non-UTF-8 terms such as CollationAnalyzer. (Nattapong Sirilappanich via Robert Muir)
  • LUCENE-4209: Fix FSTCompletionLookup to close its sorter, so that it won't leave temp files behind in /tmp. Fix SortedTermFreqIteratorWrapper to not leave temp files behind in /tmp on Windows. Fix Sort to not leave temp files behind when /tmp is a separate volume. (Uwe Schindler, Robert Muir)
  • LUCENE-4221: Fix overeager CheckIndex validation for term vector offsets. (Robert Muir)
  • LUCENE-4222: TieredMergePolicy.getFloorSegmentMB was returning the size in bytes not MB (Chris Fuller via Mike McCandless)
  • LUCENE-3505: Fix bug (Lucene 4.0alpha only) where boolean conjunctions were sometimes scored incorrectly. Conjunctions of only termqueries where at least one term omitted term frequencies (IndexOptions.DOCS_ONLY) would be scored as if all terms omitted term frequencies. (Robert Muir)
  • LUCENE-2686, LUCENE-3505: Fixed BooleanQuery scorers to return correct freq(). Added support for scorer navigation API (Scorer.getChildren) to all queries. Made Scorer.freq() abstract. (Koji Sekiguchi, Mike McCandless, Robert Muir)
  • LUCENE-4234: Exception when FacetsCollector is used with ScoreFacetRequest, and the number of matching documents is too large. (Gilad Barkai via Shai Erera)
  • LUCENE-4245: Make IndexWriter#close() and MergeScheduler#close() non-interruptible. (Mark Miller, Uwe Schindler)
  • LUCENE-4190: restrict allowed filenames that a codec may create to the patterns recognized by IndexFileNames. This also fixes IndexWriter to only delete files matching this pattern from an index directory, to reduce risk when the wrong index path is accidentally passed to IndexWriter (Robert Muir, Mike McCandless)
  • LUCENE-4277: Fix IndexWriter deadlock during rollback if flushable DWPT instance are already checked out and queued up but not yet flushed. (Simon Willnauer)
  • LUCENE-4282: Automaton FuzzyQuery didnt always deliver all results.(Johannes Christen, Uwe Schindler, Robert Muir)
  • LUCENE-4289: Fix minor idf inconsistencies/inefficiencies in highlighter. (Robert Muir)
  • Changes in Runtime Behavior:
  • LUCENE-4109: Enable position increments in the flexible queryparser by default. (Karsten Rauch via Robert Muir)
  • LUCENE-3616: Field throws exception if you try to set a boost on an unindexed field or one that omits norms. (Robert Muir)
  • Build:
  • LUCENE-4094: Support overriding file.encoding on forked test JVMs (force via -Drandomized.file.encoding=XXX). (Dawid Weiss)
  • LUCENE-4189: Test output should include timestamps (start/end for each test/ suite). Added -Dtests.timestamps=[off by default]. (Dawid Weiss)
  • LUCENE-4110: Report long periods of forked jvm inactivity (hung tests/ suites). Added -Dtests.heartbeat=[seconds] with the default of 60 seconds. (Dawid Weiss)
  • LUCENE-4160: Added a property to quit the tests after a given number of failures has occurred. This is useful in combination with -Dtests.iters=N (you can start N iterations and wait for M failures, in particular M = 1). -Dtests.maxfailures=M. Alternatively, specify -Dtests.failfast=true to skip all tests after the first failure. (Dawid Weiss)
  • LUCENE-4115: JAR resolution/ cleanup should be done automatically for ant clean/ eclipse/ resolve (Dawid Weiss)
  • LUCENE-4199, LUCENE-4202, LUCENE-4206: Add a new target "check-forbidden-apis" that parses all generated .class files for use of APIs that use default charset, default locale, or default timezone and fail build if violations found. This ensures, that Lucene / Solr is independent on local configuration options. (Uwe Schindler, Robert Muir, Dawid Weiss)
  • LUCENE-4217: Add the possibility to run tests with Atlassian Clover loaded from IVY. A development License solely for Apache code was added in the tools/ folder, but is not included in releases. (Uwe Schindler)
  • Documentation:
  • LUCENE-4195: Added package documentation and examples for org.apache.lucene.codecs (Alan Woodward via Robert Muir)

New in Apache Lucene 3.6.1 (Jul 23, 2012)

  • The concurrency of MMapIndexInput.clone() was improved, which caused a performance regression in comparison to Lucene 3.5.0.
  • MappingCharFilter was fixed to return correct final token positions.
  • QueryParser now supports +/- operators with any amount of whitespace.
  • DisjunctionMaxScorer now implements visitSubScorers().
  • Changed the visibility of Scorer#visitSubScorers() to public, otherwise it's impossible to implement Scorers outside the Lucene package. This is a small backwards break, affecting a few users who implemented custom Scorers.
  • Various analyzer bugs where fixed: Kuromoji to not produce invalid token graph due to UNK with punctuation being decompounded, invalid position length in SynonymFilter, loading of Hunspell dictionaries that use aliasing, be consistent with closing streams when loading Hunspell affix files.
  • Various bugs in FST components were fixed: Offline sorter minimum buffer size, integer overflow in sorter, FSTCompletionLookup missed to close its sorter.
  • Fixed a synchronization bug in handling taxonomies in facet module.
  • Various minor bugs were fixed: BytesRef/CharsRef copy methods with nonzero offsets and subSequence off-by-one, TieredMergePolicy returned wrong-scaled floor segment setting.

New in Apache Lucene 3.6.0 (Apr 13, 2012)

  • In addition to Java 5 and Java 6, this release has now full Java 7 support (minimum JDK 7u1 required).
  • TypeTokenFilter filters tokens based on their TypeAttribute.
  • Fixed offset bugs in a number of CharFilters, Tokenizers and TokenFilters that could lead to exceptions during highlighting.
  • Added phonetic encoders: Metaphone, Soundex, Caverphone, Beider-Morse, etc.
  • CJKBigramFilter and CJKWidthFilter replace CJKTokenizer.
  • Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation.
  • Static index pruning (Carmel pruning) removes postings with low within-document term frequency.
  • QueryParser now interprets '*' as an open end for range queries.
  • FieldValueFilter excludes documents missing the specified field.
  • CheckIndex and IndexUpgrader allow you to specify the specific FSDirectory implementation to use with the new -dir-impl command-line option.
  • FSTs can now do reverse lookup (by output) in certain cases and can be packed to reduce their size. There is now a method to retrieve top N shortest paths from a start node in an FST.
  • New WFSTCompletionLookup suggester supports finer-grained ranking for suggestions.
  • FST based suggesters now use an offline (disk-based) sort, instead of in-memory sort, when pre-sorting the suggestions.
  • ToChildBlockJoinQuery joins in the opposite direction (parent down to child documents).
  • New query-time joining is more flexible (but less performant) than index-time joins.
  • Added HTMLStripCharFilter to strip HTML markup.
  • Security fix: Better prevention of virtual machine SIGSEGVs when using MMapDirectory: Code using cloned IndexInputs of already closed indexes could possibly crash VM, allowing DoS attacks to your application.
  • Many bug fixes

New in Apache Lucene 3.4.0 (Sep 15, 2011)

  • Fixed a major bug (LUCENE-3418) whereby a Lucene index could easily become corrupted if the OS or computer crashed or lost power.
  • Added a new faceting module (contrib/facet) for computing facet counts (both hierarchical and non-hierarchical) at search time (LUCENE-3079).
  • Added a new join module (contrib/join), enabling indexing and searching of nested (parent/child) documents using BlockJoinQuery/Collector (LUCENE-3171).
  • It is now possible to index documents with term frequencies included but without positions (LUCENE-2048); previously omitTermFreqAndPositions always omitted both.
  • The modular QueryParser (contrib/queryparser) can now create NumericRangeQuery.
  • Added SynonymFilter, in contrib/analyzers, to apply multi-word synonyms during indexing or querying, including parsers to read the wordnet and solr synonym formats (LUCENE-3233).
  • You can now control how documents that don't have a value on the sort field should sort (LUCENE-3390), using SortField.setMissingValue.
  • Fixed a case where term vectors could be silently deleted from the index after addIndexes (LUCENE-3402).