From 8d6b87ba73a6bb54a156afc84d292fa1879a5e70 Mon Sep 17 00:00:00 2001 From: Kyle Isom Date: Wed, 23 Feb 2022 22:57:05 -0800 Subject: [PATCH] add spec to additional location. --- content/pages/spec.html | 238 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 238 insertions(+) create mode 100644 content/pages/spec.html diff --git a/content/pages/spec.html b/content/pages/spec.html new file mode 100644 index 0000000..29e5c2d --- /dev/null +++ b/content/pages/spec.html @@ -0,0 +1,238 @@ +Title: Functional Spec for the Exocortex +Tags: specs + +kExocortex is a tool for capturing and retaining knowledge, making it +searchable. + +This is the initial top-level draft to sort out the high-level vision. + +## Summary + +The more you learn, the harder it is to recall specific things. Fortunately, +computers are generally pretty good at remembering things. kExocortex is +my attempt at building a knowledge graph for long-term memory. + +In addition to having functionality like notetaking systems like +[Dendron](https://dendron.so), I'd like to keep track of what I call artifacts. +An artifact is a source of some knowledge; it might be a PDF copy of a book, an +image, or a URL. + +In a perfect world, I would have a local copy of everything with a remote backup. +The remote backup lets me restore the exocortex in the event of data loss. + +## Usage sketches + +### Research mode + +If I am researching a topic, I have a top-level node that contains all the +research I'm working on. I can link artifacts to a note, including URLs. One of +the reasons it makes sense to attach a URL to a document is that I can reuse +them, as well as go back and search URLs based on tags or categories. It would +make sense to tag any artifacts with relevant tags from the note once it is saved. + +For example, let's say that I am research graphing databases. In Dendron, this +note lives under `comp.database.graph`. I might find this O'Reilly book on +[Neo4J](https://go.neo4j.com/rs/710-RRC-335/images/Neo4j_Graph_Algorithms.pdf) +that discusses graph algorithms. I might link it here, and I might link it +under a Neo4J-specific node. I would store the PDF in an artifact repository, +adding relevant tags (such as "graph-database", "neo4j", "oreilly") and +categorize it under books, PDFs, comp/database/graph/neo4j. + +Going forward, if I want to revisit the book, I don't have to find it online +again. It's easily accessible from the artifact repository. + +The user interface for the knowledge graph should show a list of associated +artifacts. + +Nodes are also timestamped; I am leaning towards keep track of every time a +page was edited (but probably not the edits). If I know I was researching +graph databases last week, and I log the URLs I was reading as artifacts, +I have a better history of what I was reading. + +### Reading from a mobile device + +Sometimes I'm on my iPad or phone, and I want to save the link I'm reading. I +should be able to stash documents, URLs, etc, in the artifact repository. This +implies a remote endpoint that I can enter a URL and a tag, and have that +entered into the artifact repository later. + +### Cataloging artifacts + +If I've entered a bunch of artifacts, I should be able to see a list of ones +that need categorizing or that aren't attached to a node. + +### Autotagging + +The interface should search the text of a note to identify any tags. This +brings up an important feature: notes consist of cells, and each cell has a +type. The primary use case is to support markdown formatting and code blocks, +while not touching the code blocks during autotagging. For example, + +``` +--- +node: today.2022.02.21 +--- + +I figured out how to get Cayley running in production. + +\``` +cayleyd --some flag --other flag +\``` +``` + +The exocortex would see Cayley, identify that as a node, and add the tags for +that node to this one. It might see production and add that as a tag, e.g. for +ops-related stuff. + +### Fast capture + +I should be able to enter a quick note, which would go under a daily node tree. +Something like `quick.2022-02-27.1534`. + +This would get autotagged. Quick notes might also get a metadata tag indicating +whether I went back and integrated them into the rest of the knowledge graph. + +One way I could use this might be to text or email a note, or to have a quick +capture program on my computer. + + + +## Requirements & Assumptions + +What should it do? What assumptions are being made about it? What's +considered "in scope" and what won't the project try to do? + +Does it need to be compatible with any existing solutions or systems? + +If it's a daemon, how are you going to manage it? + +What are the dependencies that are assumed to be available? + +## System Design + +### Major components + +The system has two logical components: the artifact repository and the +knowledge graph. + +#### Artifact repository + +There should be, at a minimum, a local artifact repository. It will have its +own tooling and UI for interaction, as well as being linked to the knowledge +graph. + +Previous prototypes stored artifact metadata in SQLite, and the contents of the +artifacts in a blob store. The blob store is a content-addressable system for +retrieving arbitrary data. A remote option might use an S3-equivalent like +Minio. + +#### Knowledge graph + +The knowledge graph stores nodes. The current model stores the graph in SQLite, +using an external file sync (e.g. syncthing) to sync the databases across +machines. + +### Data model + +Previous prototypes used separate SQLite databases for the artifact repository +and the knowledge graph. + +#### Single SQLite database + +The concern with a single SQLite database is that it would be accessed by two +different systems, causing potential locking issues. + +This could be solved by a single unified backend server; this is the preferred +approach. + +#### Split SQLite databases + +The original prototype split the databases for performance reasons. However, +this was based on any empirical evidence. + +The major downside to this is that tags and categories are not shared between +the artifact repository and the knowledge graph. Categories might make sense +for splitting; e.g. an artifact category might be 'PDF' while a node might have +the category 'Research'. However, tags should be shared between both systems. + +#### PostgreSQL database + +Another option is to to use postgres. This brings a heavy ops cost, while +enabling a variety of replication and backup strategies. + +### Architectural overview + +[![The exocortex architecture](/files/i/t/exo-arch.jpg)](/files/i/exo-arch.jpg) + +There is a backend server, `exod`, that will have a gRPC endpoint for +communicating with frontends. The approach allows for a reverse-proxy front end +on a public server over Tailscale for remote devices. It also maintains a local +blob store, the database, and a connection to a remote minio server for backing +up blobs and retrieving missing blobs. + +If a standard HTTP API is needed, it can be added in later. One potential use +for this is for retrieving blobs (e.g. GET /artifacts/blob/id/...). + +## Supportability + +### Failure scenarios + +#### Data corruption + +If the data is corrupted locally, a local import from the remote end would +restore it. Alternatively, it may be restored from local backups. + +If the data is corrupted remotely, a local export to the remote end would +restore it. + +### Platform support + +The main program would ideally run on Linux primarily, but I'd like to be able +to use it on my Windows desktop too. + +### Packaging and deployment + +## Security + +The gRPC endpoint should be authenticated. The system is intended to operate +over localhost or a local network, so the use of TLS is probably untenable. +[minica](https://github.com/jsha/minica) is an option, but then key rotation +needs to be built in. + +A possible workaround is to only enable authentication (HTTP basic auth will +suffice) on the reverse proxy, which will also have TLS. + +## Project Dependencies + +The software should rely on no external sources, except for the software +packages that it uses. This can be mitigated with vendoring. + +## Open Issues + +* If I track each time a page was edited, does it make sense to roll this up? + e.g. I don't track edits to the second, but maybe to the hour or day. + + +## Milestones + +1. Specifications + a. Write up spec for the artifact repository data structures. + b. Write up a spec for the knowledge graph data structures. +2. Core systems + a. Build the artifact repository server. + b. Build the backend for the knowledge graph. + c. Build rough CLI interfaces to both. +3. Build the user interfaces. + a. Simple note taking. + b. Artifact upload and searching by tag, content type, title. + +## Review History + +This may not be applicable, but it's usually nice to have someone else +sanity check this. + +Keep a table of who reviewed the doc and when, for in-person reviews. Consider +having at least 1 in-person review. + + +