Apache Tika integration built in scala for indexing OneDrive files into ElasticSearch.
Apache Tika integration built in scala for indexing OneDrive files into ElasticSearch.
Because Windows search functionality just doesn’t cut it.
Well…right now it doesn’t. It’s a work in progress. But the idea is that Apache Tika can parse my .doc, .pdf, .docx, .pptx, .ppt, and even OneNote files so that they are machine readable. Then I will use an elasticsearch java client to index all these files.
For a better user experience, I can then make an easy GUI (probably a simple web server on localhost, using play framework? Maybe Electron? React Native?). Ideally it would be able to run on Windows 10 (unfortunately a requirement, but this is the primary use case and rationale for building in the first place) and indexing often enough that it finds changes made in last couple days.
Honestly, Windows 10 finder search works well enough (if you use content:"your phrase"
) to get by for a quick filename search or minor content search, but bugs out too often, and doesn’t get consistent results. This tool is for when you want to find all docs that mention a topic, and you want it to work reliably and not miss anything. So indexing even once a day is probably sufficient.
(see sbt instructions)
docker-compose up -d
sbt compile
And then:
sbt run
Just a quick script to get up and running, as a POC
./tika-test.sh
Should get result of something like this:
https://www.mythicsoft.com/agentransack/information/#features
Hmm…Actually maybe you should…
Got some ideas from this blog and the related project:
TikaTest.scala based almost entirely on this: https://gist.github.com/dportabella/f6bee43ab543798813e0
sbt new scala/scala-seed.g8