solr怎么在保存pdf,word等文档的时候去掉回车和空格啊,试了好多办法都不行
可以参考一下:
https://blog.csdn.net/baeiou/article/details/49280349
或者有人有修改后,不报错的源码吗
基于GPT撰写,Solr是一个搜索引擎,它不会直接修改文档内容,但是您可以在将文档上传到Solr之前进行文本预处理来去掉回车和空格。
以下是一个示例Java代码段,它演示了如何在将文档上传到Solr之前使用Apache Tika库从PDF和Word文档中提取文本并去掉回车和空格:
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.Arrays;
import org.apache.commons.lang3.StringUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class SolrIndexing {
public static void main(String[] args) throws Exception {
// Path to PDF or Word document
String filePath = "/path/to/document.pdf";
// Extract text from PDF or Word document
String text = extractText(filePath);
// Remove new lines and spaces
text = StringUtils.normalizeSpace(text);
// Index document in Solr
indexDocument(text);
}
private static String extractText(String filePath) throws Exception {
// Create input stream from file
InputStream input = new FileInputStream(new File(filePath));
// Initialize Tika parser and metadata object
Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
// Initialize handler for parsed text
BodyContentHandler handler = new BodyContentHandler();
// Parse document and extract text
try {
parser.parse(input, handler, metadata);
} catch (SAXException | TikaException e) {
throw new Exception("Error parsing document", e);
} finally {
input.close();
}
// Return extracted text
return handler.toString();
}
private static void indexDocument(String text) {
// Code for indexing document in Solr goes here
}
}
该示例使用了Apache Tika来从PDF和Word文档中提取文本,并使用Apache Commons Lang库中的StringUtils.normalizeSpace()方法来去掉回车和空格。最后,文本传递给Solr进行索引。请注意,此示例中的indexDocument()方法未实现,因为它将涉及Solr的特定实现。
以下答案由GPT-3.5大模型与博主波罗歌共同编写:
可以使用Solr的Update Request Processors(简称:URP)来在文档索引之前对文档进行处理。其中,可以使用JavaScript来获取文档内容并进行必要的处理。
具体步骤如下:
solrconfig.xml
中:<updateRequestProcessorChain name="stripSpaces">
<processor class="solr.TrimFieldUpdateProcessorFactory">
<str name="fields">*_s</str>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">text</str>
<str name="pattern">\s+</str>
<str name="replacement"> </str>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">text</str>
<str name="pattern">\n+</str>
<str name="replacement"> </str>
</processor>
<processor class="solr.RemoveBlankFieldUpdateProcessorFactory">
<str name="fieldRegex">.*_s</str>
</processor>
</updateRequestProcessorChain>
上述URP定义了一个名为stripSpaces
的URP,它包含了以下4个处理器:
TrimFieldUpdateProcessorFactory
:删除文本字段前后的空格。RegexReplaceProcessorFactory
:使用Regular Expression(正则表达式)替换匹配的文本。RemoveBlankFieldUpdateProcessorFactory
:删除所有仅包含空格的字段。在Solr的solrconfig.xml
中的updateRequestProcessorChain
中包含新定义的URP:
<updateRequestProcessorChain name="mycustomchain">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
<processor class="solr.ConfigurableIndexWriterProcessorFactory"/>
<processor class="solr.IndexTimeBoostUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory">
<bool name="directUpdate">false</bool>
<str name="distribUpdateConnTimeout">45000</str>
<str name="distribUpdateSoTimeout">340000</str>
<int name="maxBufferedDocs">1000</int>
<int name="maxBufferedDeleteTerms">1000</int>
<int name="maxThreadStates">8</int>
</processor>
<processor class="solr.DocValuesUpdateProcessorFactory"/>
<processor class="solr.FieldMutatingUpdateProcessorFactory">
<str name="fieldName">text</str>
<str name="expression">length(text)</str>
</processor>
<processor class="solr.UUIDUpdateProcessorFactory"/>
<processor class="solr.RemoveBlankFieldUpdateProcessorFactory">
<int name="maxChars">8000</int>
<str name="fieldName">*all*</str>
</processor>
<processor class="solr.TrimFieldUpdateProcessorFactory">
<str name="fieldName">title</str>
<str name="fieldName">title_graphic</str>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">title</str>
<str name="pattern">^(Mr|Mrs|Dr)\.?\s+</str>
<str name="replacement"> </str>
</processor>
<!-- 新定义的URP -->
<processor class="solr.ProcessFieldName">
<str name="processInstruction">
flatten,locale=en,removeBlanks,lowercase,stripSpaces
</str>
</processor>
<!-- balancers are optional -->
<processor class="solr.DocBasedVersionConstraints"/>
<processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory"/>
<processor class="solr.RunFinalizersUpdateProcessorFactory"/>
</updateRequestProcessorChain>
在schema.xml
中定义Solr中的字段和动态字段和,添加到URP链的相关处理器:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="name" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true" />
<field name="file" type="string" indexed="false" stored="true" />
<dynamicField name="*_s" type="string" indexed="true" stored="true" multiValued="true" />
<copyField source="name" dest="_text_"/>
<copyField source="title" dest="_text_"/>
<copyField source="*_s" dest="_text_"/>
<updateRequestProcessorChain name="myupdatechain">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory">
<bool name="directUpdate">false</bool>
<str name="distribUpdateConnTimeout">45000</str>
<str name="distribUpdateSoTimeout">340000</str>
<int name="maxBufferedDocs">1000</int>
<int name="maxBufferedDeleteTerms">1000</int>
<int name="maxThreadStates">8</int>
</processor>
<processor class="solr.DocValuesUpdateProcessorFactory"/>
<processor class="solr.FieldMutatingUpdateProcessorFactory">
<str name="fieldName">text</str>
<str name="expression">length(text)</str>
</processor>
<processor class="solr.UUIDUpdateProcessorFactory"/>
<processor class="solr.ProcessFieldName">
<str name="processInstruction">
flatten,locale=en,removeBlanks,lowercase,stripSpaces
</str>
</processor>
<processor class="solr.RemoveBlankFieldUpdateProcessorFactory">
<int name="maxChars">8000</int>
<str name="fieldName">*all*</str>
</processor>
<processor class="solr.TrimFieldUpdateProcessorFactory">
<str name="fieldName">title</str>
<str name="fieldName">title_graphic</str>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">title</str>
<str name="pattern">^(Mr|Mrs|Dr)\.?\s+</str>
<str name="replacement"> </str>
</processor>
<processor class="solr.DocBasedVersionConstraints"/>
<processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory"/>
<processor class="solr.RemoveBlankFieldUpdateProcessorFactory">
<int name="maxChars">8000</int>
</processor>
<processor class="solr.RunFinalizersUpdateProcessorFactory"/>
</updateRequestProcessorChain>
stripSpaces
添加到您使用的更新请求处理器中:{
"add": {
"doc": {
"id": "123",
"name": "test.pdf",
"file": "path/to/test.pdf",
"title": "This is a Test Document",
"content_s": "some text"
},
"boost": 1.0,
"overwrite": true,
"commitWithin": 1000,
"processor": [
{
"name":"stripSpaces"
}
]
}
}
上述JSON示例将stripSpaces
添加到文档的更新请求处理器中。
希望能够帮助到你解决问题。
如果我的回答解决了您的问题,请采纳!
remove_whitespace
text
\\s+
curl http://localhost:8983/solr/mycore/update -H 'Content-type:application/json' -d '
[
{
"id": "1",
"title": "My Document",
"text": "This is a document with\\r\
\\r\
multiple lines and\\ttabs."
}
]'&commit=true