Classify the test set according to the real situation of the target users rather than . The classification of the knowledge base itself otherwise it may be far from the actual . Situation dont ignore the coverage of form some product designers or content experts think that . Everything will be fine if the content covers multiple fields when preparing test sets but . In practice we found that it is also important to cover as many situations of . Users real use as possible in terms of form because the performance of the model .
Will Be Different When It Responds to
Will be different when it responds to different forms of input based on the knowledge . Question and answer function even if it is a question taiwan email list related to the product managers . Business analysis users may ask it in different ways which means different user intentions behind . It the following figure is the classification of question forms when we were doing the . Knowledge question and answer function for reference by partners therefore the second point in preparing . The test set in addition to the coverage of the content theme the coverage of .
Form Also Needs to Be Paid Attention
Form also needs to be paid attention to coverage of expression habits in the knowledge . Questionandanswer function the users a perfectly imperfect day expression habits can also be taken into consideration for the same . Intention users with different language habits will produce different expressions for example some people are . Used to inversion and some people are used to simplicity therefore when preparing the test . Set you also need to prepare some questions with different expression habits preferably real user . Input to ensure that llm can understand the second pitfall measurement dimension details determine success .
Or Failure the Three Axes of Measurement
Or failure the three axes of measurement standards accuracy sufficiency and relevance next lets look . At whether there are any pitfalls from bzb directory three dimensions fuzzy accuracy is a difficult point . If the accuracy is accurately evaluated due to the natural language characteristics of the large . Model the output is not always certain and quantitatively assessable in order to reduce this . Unassessable problem the solution we are currently exploring is to divide the correctness standards into . Three categories to increase the consistency of evaluation must be correct for the output content .