239 lines
8.1 KiB
Markdown
239 lines
8.1 KiB
Markdown
|
Title: Functional Spec for the Exocortex
|
||
|
Tags: specs
|
||
|
|
||
|
kExocortex is a tool for capturing and retaining knowledge, making it
|
||
|
searchable.
|
||
|
|
||
|
This is the initial top-level draft to sort out the high-level vision.
|
||
|
|
||
|
## Summary
|
||
|
|
||
|
The more you learn, the harder it is to recall specific things. Fortunately,
|
||
|
computers are generally pretty good at remembering things. kExocortex is
|
||
|
my attempt at building a knowledge graph for long-term memory.
|
||
|
|
||
|
In addition to having functionality like notetaking systems like
|
||
|
[Dendron](https://dendron.so), I'd like to keep track of what I call artifacts.
|
||
|
An artifact is a source of some knowledge; it might be a PDF copy of a book, an
|
||
|
image, or a URL.
|
||
|
|
||
|
In a perfect world, I would have a local copy of everything with a remote backup.
|
||
|
The remote backup lets me restore the exocortex in the event of data loss.
|
||
|
|
||
|
## Usage sketches
|
||
|
|
||
|
### Research mode
|
||
|
|
||
|
If I am researching a topic, I have a top-level node that contains all the
|
||
|
research I'm working on. I can link artifacts to a note, including URLs. One of
|
||
|
the reasons it makes sense to attach a URL to a document is that I can reuse
|
||
|
them, as well as go back and search URLs based on tags or categories. It would
|
||
|
make sense to tag any artifacts with relevant tags from the note once it is saved.
|
||
|
|
||
|
For example, let's say that I am research graphing databases. In Dendron, this
|
||
|
note lives under `comp.database.graph`. I might find this O'Reilly book on
|
||
|
[Neo4J](https://go.neo4j.com/rs/710-RRC-335/images/Neo4j_Graph_Algorithms.pdf)
|
||
|
that discusses graph algorithms. I might link it here, and I might link it
|
||
|
under a Neo4J-specific node. I would store the PDF in an artifact repository,
|
||
|
adding relevant tags (such as "graph-database", "neo4j", "oreilly") and
|
||
|
categorize it under books, PDFs, comp/database/graph/neo4j.
|
||
|
|
||
|
Going forward, if I want to revisit the book, I don't have to find it online
|
||
|
again. It's easily accessible from the artifact repository.
|
||
|
|
||
|
The user interface for the knowledge graph should show a list of associated
|
||
|
artifacts.
|
||
|
|
||
|
Nodes are also timestamped; I am leaning towards keep track of every time a
|
||
|
page was edited (but probably not the edits). If I know I was researching
|
||
|
graph databases last week, and I log the URLs I was reading as artifacts,
|
||
|
I have a better history of what I was reading.
|
||
|
|
||
|
### Reading from a mobile device
|
||
|
|
||
|
Sometimes I'm on my iPad or phone, and I want to save the link I'm reading. I
|
||
|
should be able to stash documents, URLs, etc, in the artifact repository. This
|
||
|
implies a remote endpoint that I can enter a URL and a tag, and have that
|
||
|
entered into the artifact repository later.
|
||
|
|
||
|
### Cataloging artifacts
|
||
|
|
||
|
If I've entered a bunch of artifacts, I should be able to see a list of ones
|
||
|
that need categorizing or that aren't attached to a node.
|
||
|
|
||
|
### Autotagging
|
||
|
|
||
|
The interface should search the text of a note to identify any tags. This
|
||
|
brings up an important feature: notes consist of cells, and each cell has a
|
||
|
type. The primary use case is to support markdown formatting and code blocks,
|
||
|
while not touching the code blocks during autotagging. For example,
|
||
|
|
||
|
```
|
||
|
---
|
||
|
node: today.2022.02.21
|
||
|
---
|
||
|
|
||
|
I figured out how to get Cayley running in production.
|
||
|
|
||
|
\```
|
||
|
cayleyd --some flag --other flag
|
||
|
\```
|
||
|
```
|
||
|
|
||
|
The exocortex would see Cayley, identify that as a node, and add the tags for
|
||
|
that node to this one. It might see production and add that as a tag, e.g. for
|
||
|
ops-related stuff.
|
||
|
|
||
|
### Fast capture
|
||
|
|
||
|
I should be able to enter a quick note, which would go under a daily node tree.
|
||
|
Something like `quick.2022-02-27.1534`.
|
||
|
|
||
|
This would get autotagged. Quick notes might also get a metadata tag indicating
|
||
|
whether I went back and integrated them into the rest of the knowledge graph.
|
||
|
|
||
|
One way I could use this might be to text or email a note, or to have a quick
|
||
|
capture program on my computer.
|
||
|
|
||
|
|
||
|
|
||
|
## Requirements & Assumptions
|
||
|
|
||
|
What should it do? What assumptions are being made about it? What's
|
||
|
considered "in scope" and what won't the project try to do?
|
||
|
|
||
|
Does it need to be compatible with any existing solutions or systems?
|
||
|
|
||
|
If it's a daemon, how are you going to manage it?
|
||
|
|
||
|
What are the dependencies that are assumed to be available?
|
||
|
|
||
|
## System Design
|
||
|
|
||
|
### Major components
|
||
|
|
||
|
The system has two logical components: the artifact repository and the
|
||
|
knowledge graph.
|
||
|
|
||
|
#### Artifact repository
|
||
|
|
||
|
There should be, at a minimum, a local artifact repository. It will have its
|
||
|
own tooling and UI for interaction, as well as being linked to the knowledge
|
||
|
graph.
|
||
|
|
||
|
Previous prototypes stored artifact metadata in SQLite, and the contents of the
|
||
|
artifacts in a blob store. The blob store is a content-addressable system for
|
||
|
retrieving arbitrary data. A remote option might use an S3-equivalent like
|
||
|
Minio.
|
||
|
|
||
|
#### Knowledge graph
|
||
|
|
||
|
The knowledge graph stores nodes. The current model stores the graph in SQLite,
|
||
|
using an external file sync (e.g. syncthing) to sync the databases across
|
||
|
machines.
|
||
|
|
||
|
### Data model
|
||
|
|
||
|
Previous prototypes used separate SQLite databases for the artifact repository
|
||
|
and the knowledge graph.
|
||
|
|
||
|
#### Single SQLite database
|
||
|
|
||
|
The concern with a single SQLite database is that it would be accessed by two
|
||
|
different systems, causing potential locking issues.
|
||
|
|
||
|
This could be solved by a single unified backend server; this is the preferred
|
||
|
approach.
|
||
|
|
||
|
#### Split SQLite databases
|
||
|
|
||
|
The original prototype split the databases for performance reasons. However,
|
||
|
this was based on any empirical evidence.
|
||
|
|
||
|
The major downside to this is that tags and categories are not shared between
|
||
|
the artifact repository and the knowledge graph. Categories might make sense
|
||
|
for splitting; e.g. an artifact category might be 'PDF' while a node might have
|
||
|
the category 'Research'. However, tags should be shared between both systems.
|
||
|
|
||
|
#### PostgreSQL database
|
||
|
|
||
|
Another option is to to use postgres. This brings a heavy ops cost, while
|
||
|
enabling a variety of replication and backup strategies.
|
||
|
|
||
|
### Architectural overview
|
||
|
|
||
|
[![The exocortex architecture](/files/i/t/exo-arch.jpg)](/files/i/exo-arch.jpg)
|
||
|
|
||
|
There is a backend server, `exod`, that will have a gRPC endpoint for
|
||
|
communicating with frontends. The approach allows for a reverse-proxy front end
|
||
|
on a public server over Tailscale for remote devices. It also maintains a local
|
||
|
blob store, the database, and a connection to a remote minio server for backing
|
||
|
up blobs and retrieving missing blobs.
|
||
|
|
||
|
If a standard HTTP API is needed, it can be added in later. One potential use
|
||
|
for this is for retrieving blobs (e.g. GET /artifacts/blob/id/...).
|
||
|
|
||
|
## Supportability
|
||
|
|
||
|
### Failure scenarios
|
||
|
|
||
|
#### Data corruption
|
||
|
|
||
|
If the data is corrupted locally, a local import from the remote end would
|
||
|
restore it. Alternatively, it may be restored from local backups.
|
||
|
|
||
|
If the data is corrupted remotely, a local export to the remote end would
|
||
|
restore it.
|
||
|
|
||
|
### Platform support
|
||
|
|
||
|
The main program would ideally run on Linux primarily, but I'd like to be able
|
||
|
to use it on my Windows desktop too.
|
||
|
|
||
|
### Packaging and deployment
|
||
|
|
||
|
## Security
|
||
|
|
||
|
The gRPC endpoint should be authenticated. The system is intended to operate
|
||
|
over localhost or a local network, so the use of TLS is probably untenable.
|
||
|
[minica](https://github.com/jsha/minica) is an option, but then key rotation
|
||
|
needs to be built in.
|
||
|
|
||
|
A possible workaround is to only enable authentication (HTTP basic auth will
|
||
|
suffice) on the reverse proxy, which will also have TLS.
|
||
|
|
||
|
## Project Dependencies
|
||
|
|
||
|
The software should rely on no external sources, except for the software
|
||
|
packages that it uses. This can be mitigated with vendoring.
|
||
|
|
||
|
## Open Issues
|
||
|
|
||
|
* If I track each time a page was edited, does it make sense to roll this up?
|
||
|
e.g. I don't track edits to the second, but maybe to the hour or day.
|
||
|
|
||
|
|
||
|
## Milestones
|
||
|
|
||
|
1. Specifications
|
||
|
a. Write up spec for the artifact repository data structures.
|
||
|
b. Write up a spec for the knowledge graph data structures.
|
||
|
2. Core systems
|
||
|
a. Build the artifact repository server.
|
||
|
b. Build the backend for the knowledge graph.
|
||
|
c. Build rough CLI interfaces to both.
|
||
|
3. Build the user interfaces.
|
||
|
a. Simple note taking.
|
||
|
b. Artifact upload and searching by tag, content type, title.
|
||
|
|
||
|
## Review History
|
||
|
|
||
|
This may not be applicable, but it's usually nice to have someone else
|
||
|
sanity check this.
|
||
|
|
||
|
Keep a table of who reviewed the doc and when, for in-person reviews. Consider
|
||
|
having at least 1 in-person review.
|
||
|
|
||
|
|
||
|
|