exo-notes/content/pages/spec.md

Title: Functional Spec for the Exocortex
Tags: specs

kExocortex is a tool for capturing and retaining knowledge, making it
searchable.

This is the initial top-level draft to sort out the high-level vision.

## Summary

The more you learn, the harder it is to recall specific things. Fortunately,
computers are generally pretty good at remembering things. kExocortex is
my attempt at building a knowledge graph for long-term memory.

In addition to having functionality like notetaking systems like
[Dendron](https://dendron.so), I'd like to keep track of what I call artifacts.
An artifact is a source of some knowledge; it might be a PDF copy of a book, an
image, or a URL.

In a perfect world, I would have a local copy of everything with a remote backup.
The remote backup lets me restore the exocortex in the event of data loss.

## Usage sketches

### Research mode

If I am researching a topic, I have a top-level node that contains all the
research I'm working on. I can link artifacts to a note, including URLs. One of
the reasons it makes sense to attach a URL to a document is that I can reuse
them, as well as go back and search URLs based on tags or categories. It would
make sense to tag any artifacts with relevant tags from the note once it is saved.

For example, let's say that I am research graphing databases. In Dendron, this
note lives under `comp.database.graph`. I might find this O'Reilly book on
[Neo4J](https://go.neo4j.com/rs/710-RRC-335/images/Neo4j_Graph_Algorithms.pdf)
that discusses graph algorithms. I might link it here, and I might link it
under a Neo4J-specific node. I would store the PDF in an artifact repository,
adding relevant tags (such as "graph-database", "neo4j", "oreilly") and
categorize it under books, PDFs, comp/database/graph/neo4j.

Going forward, if I want to revisit the book, I don't have to find it online
again. It's easily accessible from the artifact repository.

The user interface for the knowledge graph should show a list of associated
artifacts.

Nodes are also timestamped; I am leaning towards keep track of every time a
page was edited (but probably not the edits). If I know I was researching
graph databases last week, and I log the URLs I was reading as artifacts,
I have a better history of what I was reading.

### Reading from a mobile device

Sometimes I'm on my iPad or phone, and I want to save the link I'm reading. I
should be able to stash documents, URLs, etc, in the artifact repository. This
implies a remote endpoint that I can enter a URL and a tag, and have that
entered into the artifact repository later.

### Cataloging artifacts

If I've entered a bunch of artifacts, I should be able to see a list of ones
that need categorizing or that aren't attached to a node.

### Autotagging

The interface should search the text of a note to identify any tags. This
brings up an important feature: notes consist of cells, and each cell has a
type. The primary use case is to support markdown formatting and code blocks,
while not touching the code blocks during autotagging. For example,

```
---
node: today.2022.02.21
---

I figured out how to get Cayley running in production.

\```
cayleyd --some flag --other flag
\```
```

The exocortex would see Cayley, identify that as a node, and add the tags for
that node to this one. It might see production and add that as a tag, e.g. for
ops-related stuff.

### Fast capture

I should be able to enter a quick note, which would go under a daily node tree.
Something like `quick.2022-02-27.1534`.

This would get autotagged. Quick notes might also get a metadata tag indicating
whether I went back and integrated them into the rest of the knowledge graph.

One way I could use this might be to text or email a note, or to have a quick
capture program on my computer.


## Requirements & Assumptions

What should it do? What assumptions are being made about it? What's
considered "in scope" and what won't the project try to do?

Does it need to be compatible with any existing solutions or systems?

If it's a daemon, how are you going to manage it?

What are the dependencies that are assumed to be available?

## System Design

### Major components

The system has two logical components: the artifact repository and the
knowledge graph.

#### Artifact repository

There should be, at a minimum, a local artifact repository. It will have its
own tooling and UI for interaction, as well as being linked to the knowledge
graph.

Previous prototypes stored artifact metadata in SQLite, and the contents of the
artifacts in a blob store. The blob store is a content-addressable system for
retrieving arbitrary data. A remote option might use an S3-equivalent like
Minio.

#### Knowledge graph

The knowledge graph stores nodes. The current model stores the graph in SQLite,
using an external file sync (e.g. syncthing) to sync the databases across
machines.

### Data model

Previous prototypes used separate SQLite databases for the artifact repository
and the knowledge graph.

#### Single SQLite database

The concern with a single SQLite database is that it would be accessed by two
different systems, causing potential locking issues.

This could be solved by a single unified backend server; this is the preferred
approach.

#### Split SQLite databases

The original prototype split the databases for performance reasons. However,
this was based on any empirical evidence.

The major downside to this is that tags and categories are not shared between
the artifact repository and the knowledge graph. Categories might make sense
for splitting; e.g. an artifact category might be 'PDF' while a node might have
the category 'Research'. However, tags should be shared between both systems.

#### PostgreSQL database

Another option is to to use postgres. This brings a heavy ops cost, while
enabling a variety of replication and backup strategies.

### Architectural overview

[![The exocortex architecture](/files/i/t/exo-arch.jpg)](/files/i/exo-arch.jpg)

There is a backend server, `exod`, that will have a gRPC endpoint for
communicating with frontends. The approach allows for a reverse-proxy front end
on a public server over Tailscale for remote devices. It also maintains a local
blob store, the database, and a connection to a remote minio server for backing
up blobs and retrieving missing blobs.

If a standard HTTP API is needed, it can be added in later. One potential use
for this is for retrieving blobs (e.g. GET /artifacts/blob/id/...).

## Supportability

### Failure scenarios

#### Data corruption

If the data is corrupted locally, a local import from the remote end would
restore it. Alternatively, it may be restored from local backups.

If the data is corrupted remotely, a local export to the remote end would
restore it.

### Platform support

The main program would ideally run on Linux primarily, but I'd like to be able
to use it on my Windows desktop too.

### Packaging and deployment

## Security

The gRPC endpoint should be authenticated. The system is intended to operate
over localhost or a local network, so the use of TLS is probably untenable.
[minica](https://github.com/jsha/minica) is an option, but then key rotation
needs to be built in.

A possible workaround is to only enable authentication (HTTP basic auth will
suffice) on the reverse proxy, which will also have TLS.

## Project Dependencies

The software should rely on no external sources, except for the software
packages that it uses. This can be mitigated with vendoring.

## Open Issues

* If I track each time a page was edited, does it make sense to roll this up?
  e.g. I don't track edits to the second, but maybe to the hour or day.


## Milestones

1. Specifications
   a. Write up spec for the artifact repository data structures.
   b. Write up a spec for the knowledge graph data structures.
2. Core systems
   a. Build the artifact repository server.
   b. Build the backend for the knowledge graph.
   c. Build rough CLI interfaces to both.
3. Build the user interfaces.
   a. Simple note taking.
   b. Artifact upload and searching by tag, content type, title.

## Review History

This may not be applicable, but it's usually nice to have someone else
sanity check this.

Keep a table of who reviewed the doc and when, for in-person reviews. Consider
having at least 1 in-person review.