Handling specific data formats

Boox is primarily designed to work with plain text data. However, you can customize how Boox processes different data formats to extract the relevant text for indexing and searching. This section provides examples of how to handle specific data formats like HTML, Markdown, DOCX, PPTX, XLSX, ODT, ODP, ODS, PDF, etc.

HTML

If your dataset contains HTML documents, you'll need to extract the plain text content before adding it to Boox. You can use a library like stophtml to do this:

import Boox from 'boox'
import stophtml from 'stophtml'

const html = `
<h1>My HTML Document</h1>

<p>This is some content with <b>important</b> text.</p>
`

// Create a Boox instance and add the document
const boox = new Boox({
  features: ['title', 'content']
})

await boox.addDocument({
  id: 'doc1',
  title: 'My HTML Document',
  content: stophtml(html)
})

Markdown

Similarly, if your dataset contains Markdown documents, you can use a library like marked-plaintify to extract the plain text content:

import Boox from 'boox'
import { Marked } from 'marked'
import markedPlaintify from 'marked-plaintify'

const markdown = `
# My Markdown Document

This is some content with **important** text.
`

// Turn markdown to plaintext
const marked = new Marked({ gfm: true }).use(markedPlaintify())
const plaintext = marked.parse(markdown)

// Create a Boox instance and add the document
const boox = new Boox({
  features: ['title', 'content']
})

await boox.addDocument({
  id: 'doc1',
  title: 'My Markdown Document',
  content: plaintext
})

Other data formats

You can use similar approaches to handle other data formats such as DOCX, PPTX, XLSX, ODT, ODP, ODS, PDF, etc. Utilize a library like officeParser, office-text-extractor, word-extractor, or something similar. The key is to extract the relevant text content from the documents before adding them to Boox.