Releases: benbrandt/text-splitter
Releases · benbrandt/text-splitter
v0.28.0
v0.27.0
v0.26.0
What's New
- Updated to
icuv2.0 for all unicode segmentation. - Minimum Rust version updated to 1.82.0
Full Changelog: v0.25.1...v0.26.0
v0.25.1
What's New
- Use
memchrcrate instead ofregexfor parsing phase inTextSplitter. This should improve performance in how quickly the text is parsed when scanning for newline characters. #650 - Implement
ChunkSizertrait automatically for many more wrappers and references to types that already implementChunkSizer#649
Full Changelog: v0.25.0...v0.25.1
v0.25.0
Breaking Changes
Rust
- Remove support for
rust-tokenizerscrate. This crate hasn't been updated in several years and brings in depednencies that have security warnings.
What's New
- Use faster encoding method for
tokenizerslibrary, which improves performance with usage of huggingface tokenizers.
Full Changelog: v0.24.2...v0.25.0
v0.24.2
v0.24.1
What's Changed
Added a new chunk_char_indices method to the Rust splitters in #607
use text_splitter::{Characters, ChunkCharIndex, TextSplitter};
let text = "\r\na̐éö̲\r\n";
let splitter = TextSplitter::new(3);
let chunks = splitter.chunk_char_indices(text).collect::<Vec<_>>();
assert_eq!(
vec![
ChunkCharIndex {
chunk: "a̐é",
byte_offset: 2,
char_offset: 2
},
ChunkCharIndex {
chunk: "ö̲",
byte_offset: 7,
char_offset: 5
}
],
chunks
);The pulls logic from the Python bindings down into the core library. This will be more expensive than just byte offsets, and for most usage in Rust, just having byte offsets is sufficient.
However, when interfacing with other languages or systems that require character offsets, this will track the character offsets for you, accounting for any trimming that may have occurred.
Full Changelog: v0.24.0...v0.24.1
v0.24.0
What's Changed
Update to pulldown-cmark 0.13.0 to improve Markdown parsing.
Full Changelog: v0.23.0...v0.24.0
v0.23.0
v0.22.0
Breaking Changes
- Revert change to special token behavior in v0.21. This had many unintended side effects, and does not seem to be recommended for chunking.
Full Changelog: v0.21.0...v0.22.0