From 7274420ecde7729dd6e7ad038ffdea5f7ca915e1 Mon Sep 17 00:00:00 2001 From: writinwaters <93570324+writinwaters@users.noreply.github.com> Date: Tue, 12 Nov 2024 19:56:56 +0800 Subject: [PATCH] Updated RAGFlow UI (#3362) ### What problem does this PR solve? ### Type of change - [x] Documentation Update --- docker/README.md | 10 +++++-- docs/configurations.md | 10 +++++-- web/src/locales/en.ts | 59 ++++++++++++++++++------------------------ 3 files changed, 41 insertions(+), 38 deletions(-) diff --git a/docker/README.md b/docker/README.md index ae872654..6731d658 100644 --- a/docker/README.md +++ b/docker/README.md @@ -102,13 +102,19 @@ The [.env](./.env) file contains important environment variables for Docker. > - `RAGFLOW_IMAGE=swr.cn-north-4.myhuaweicloud.com/infiniflow/ragflow:dev` or, > - `RAGFLOW_IMAGE=registry.cn-hangzhou.aliyuncs.com/infiniflow/ragflow:dev`. -### Miscellaneous +### Timezone - `TIMEZONE` The local time zone. Defaults to `'Asia/Shanghai'`. + +### Hugging Face mirror site + - `HF_ENDPOINT` The mirror site for huggingface.co. It is disabled by default. You can uncomment this line if you have limited access to the primary Hugging Face domain. -- `MACOS`   + +### MacOS + +- `MACOS` Optimizations for MacOS. It is disabled by default. You can uncomment this line if your OS is MacOS. ## 🐋 Service configuration diff --git a/docs/configurations.md b/docs/configurations.md index afa3e060..315f608a 100644 --- a/docs/configurations.md +++ b/docs/configurations.md @@ -123,13 +123,19 @@ If you cannot download the RAGFlow Docker image, try the following mirrors. - `RAGFLOW_IMAGE=registry.cn-hangzhou.aliyuncs.com/infiniflow/ragflow:dev`. ::: -### Miscellaneous +### Timezone - `TIMEZONE` The local time zone. Defaults to `'Asia/Shanghai'`. + +### Hugging Face mirror site + - `HF_ENDPOINT` The mirror site for huggingface.co. It is disabled by default. You can uncomment this line if you have limited access to the primary Hugging Face domain. -- `MACOS`   + +### MacOS + +- `MACOS` Optimizations for MacOS. It is disabled by default. You can uncomment this line if your OS is MacOS. ## Service configuration diff --git a/web/src/locales/en.ts b/web/src/locales/en.ts index 7a8e9816..adfd893a 100644 --- a/web/src/locales/en.ts +++ b/web/src/locales/en.ts @@ -200,43 +200,39 @@ export default { methodEmpty: 'This will display a visual explanation of the knowledge base categories', book: `

Supported file formats are DOCX, PDF, TXT.

- Since a book is long and not all the parts are useful, if it's a PDF, - please setup the page ranges for every book in order eliminate negative effects and save computing time for analyzing.

`, + For each book in PDF, please set the page ranges to remove unwanted information and reduce analysis time.

`, laws: `

Supported file formats are DOCX, PDF, TXT.

- Legal documents have a very rigorous writing format. We use text feature to detect split point. + Legal documents typically follow a rigorous writing format. We use text feature to identify split point.

- The chunk granularity is consistent with 'ARTICLE', and all the upper level text will be included in the chunk. + The chunk has a granularity consistent with 'ARTICLE', ensuring all upper level text is included in the chunk.

`, manual: `

Only PDF is supported.

We assume that the manual has a hierarchical section structure, using the lowest section titles as basic unit for chunking documents. Therefore, figures and tables in the same section will not be separated, which may result in larger chunk sizes.

`, naive: `

Supported file formats are DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML, HTML.

-

This method apply the naive ways to chunk files:

+

This method chunks files using the 'naive' way:

-

  • Successive text will be sliced into pieces using vision detection model.
  • -
  • Next, these successive pieces are merge into chunks whose token number is no more than 'Token number'.
  • `, +
  • Use vision detection model to split the texts into smaller segments.
  • +
  • Then, combine adjacent segments until the token count exceeds the threshold specified by 'Chunk token number', at which point a chunk is created.
  • `, paper: `

    Only PDF file is supported.

    - If our model works well, the paper will be sliced by it's sections, like abstract, 1.1, 1.2, etc.

    - The benefit of doing this is that LLM can better summarize the content of relevant sections in the paper, - resulting in more comprehensive answers that help readers better understand the paper. - The downside is that it increases the context of the LLM conversation and adds computational cost, - so during the conversation, you can consider reducing the ‘topN’ setting.

    `, - presentation: `

    The supported file formats are PDF, PPTX.

    - Every page will be treated as a chunk. And the thumbnail of every page will be stored.

    - All the PPT files you uploaded will be chunked by using this method automatically, setting-up for every PPT file is not necessary.

    `, + Papers will be split by section, such as abstract, 1.1, 1.2.

    + This approach enables the LLM to summarize the paper more effectively and provide more comprehensive, understandable responses. + However, it also increases the context for AI conversations and adds to the computational cost for the LLM. So during a conversation, consider reducing the value of ‘topN’.

    `, + presentation: `

    Supported file formats are PDF, PPTX.

    + Every page in the slides is treated as a chunk, with its thumbnail image stored.

    + This chunk method is automatically applied to all uploaded PPT files, so you do not need to specify it manually.

    `, qa: `

    This chunk method supports EXCEL and CSV/TXT file formats.

  • - If the file is in Excel format, it should consist of two columns + If a file is in Excel format, it should contain two columns without headers: one for questions and the other for answers, with the question column preceding the answer column. Multiple sheets are - acceptable as long as the columns are correctly structured. + acceptable, provided the columns are properly structured.
  • - If the file is in CSV/TXT format, it must be UTF-8 encoded with TAB - used as the delimiter to separate questions and answers. + If a file is in CSV/TXT format, it must be UTF-8 encoded with TAB as the delimiter to separate questions and answers.
  • @@ -245,25 +241,20 @@ export default {

    `, - resume: `

    The supported file formats are DOCX, PDF, TXT. + resume: `

    Supported file formats are DOCX, PDF, TXT.

    - The résumé comes in a variety of formats, just like a person’s personality, but we often have to organize them into structured data that makes it easy to search. -

    - Instead of chunking the résumé, we parse the résumé into structured data. As a HR, you can dump all the résumé you have, - the you can list all the candidates that match the qualifications just by talk with 'RAGFlow'. + Résumés of various forms are parsed and organized into structured data to facilitate candidate search for recruiters.

    `, - table: `

    EXCEL and CSV/TXT format files are supported.

    - Here're some tips: + table: `

    Supported file formats are EXCEL and CSV/TXT.

    + Here're some prerequisites and tips: