I have a PHP application that uses a SOLR database. The problem appeared when I am doing a /terms request (terms doc)
So the parts of the document that interest us are
poi: "Bistriţa",
...
text: [
"ddt",
"Numeric",
"/14/Gagaga 2/11/Economics/17/datenow",
"/20/Daniel_same/11/Economics/17/datenow",
"0/Gagaga 2",
"1/Gagaga 2/Economics",
"2/Gagaga 2/Economics/datenow",
"0/Daniel_same",
"1/Daniel_same/Economics",
"2/Daniel_same/Economics/datenow",
"ppla",
"seat of a first-order administrative division",
"/19/Daniel_same/1071/Plurinational State of Bolivia/2269/Cuba/2272/Bistriţa",
"0/Daniel_same",
"1/Daniel_same/Plurinational State of Bolivia",
"2/Daniel_same/Plurinational State of Bolivia/Cuba",
"3/Daniel_same/Plurinational State of Bolivia/Cuba/Bistriţa",
"0/Undefined_activity",
"Year",
"0/1999",
"0/1999",
"Measured",
"",
"utf8"
],
And the request is
http://localhost:8080/solr/terms
?wt=json
&indent=true
&terms.sort=count
&terms.mincount=1
&terms.limit=10
&terms.regex.flag=case_insensitive
&terms.regex=.*bi.*
&terms.fl=text
The response is
{
responseHeader: {
status: 0,
QTime: 4
},
terms: {
text: [
"bistriå",
16
]
}
}
The problem with the result is that the resulted text is truncated. I was expecting "BistriÅ£a" which is an UTF-8 encoding of a city Bistrița. But the result seems to be truncated at the special character.
The weird thing is if I do the request with the field name "poi" instead of "text", I get a correct response
http://localhost:8080/solr/terms
?wt=json
&indent=true
&terms.sort=count
&terms.mincount=1
&terms.limit=10
&terms.regex.flag=case_insensitive
&terms.regex=.*bi.*
&terms.fl=poi
{
responseHeader: {
status: 0,
QTime: 4
},
terms: {
text: [
"Bistriţa",
16
]
}
}
So the word is not truncated.
The big difference between the 2 fields is the type. Poi has a string type and text has a text_general type. The text_general type is defined in the schema like this
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I can provide more details if asked. Not sure what I can add now and not bloat the question too much.
You probably want to consider using the AsciiFoldingFilterFactory in your text_general
field to appropriately handle the special character. Additionally, please refer to the Language Analysis support provided by Solr that may be of use to you.