Title: Functional Spec for the Exocortex Tags: specs kExocortex is a tool for capturing and retaining knowledge, making it searchable. This is the initial top-level draft to sort out the high-level vision. ## Summary The more you learn, the harder it is to recall specific things. Fortunately, computers are generally pretty good at remembering things. kExocortex is my attempt at building a knowledge graph for long-term memory. In addition to having functionality like notetaking systems like [Dendron](https://dendron.so), I'd like to keep track of what I call artifacts. An artifact is a source of some knowledge; it might be a PDF copy of a book, an image, or a URL. In a perfect world, I would have a local copy of everything with a remote backup. The remote backup lets me restore the exocortex in the event of data loss. ## Usage sketches ### Research mode If I am researching a topic, I have a top-level node that contains all the research I'm working on. I can link artifacts to a note, including URLs. One of the reasons it makes sense to attach a URL to a document is that I can reuse them, as well as go back and search URLs based on tags or categories. It would make sense to tag any artifacts with relevant tags from the note once it is saved. For example, let's say that I am research graphing databases. In Dendron, this note lives under `comp.database.graph`. I might find this O'Reilly book on [Neo4J](https://go.neo4j.com/rs/710-RRC-335/images/Neo4j_Graph_Algorithms.pdf) that discusses graph algorithms. I might link it here, and I might link it under a Neo4J-specific node. I would store the PDF in an artifact repository, adding relevant tags (such as "graph-database", "neo4j", "oreilly") and categorize it under books, PDFs, comp/database/graph/neo4j. Going forward, if I want to revisit the book, I don't have to find it online again. It's easily accessible from the artifact repository. The user interface for the knowledge graph should show a list of associated artifacts. Nodes are also timestamped; I am leaning towards keep track of every time a page was edited (but probably not the edits). If I know I was researching graph databases last week, and I log the URLs I was reading as artifacts, I have a better history of what I was reading. ### Reading from a mobile device Sometimes I'm on my iPad or phone, and I want to save the link I'm reading. I should be able to stash documents, URLs, etc, in the artifact repository. This implies a remote endpoint that I can enter a URL and a tag, and have that entered into the artifact repository later. ### Cataloging artifacts If I've entered a bunch of artifacts, I should be able to see a list of ones that need categorizing or that aren't attached to a node. ### Autotagging The interface should search the text of a note to identify any tags. This brings up an important feature: notes consist of cells, and each cell has a type. The primary use case is to support markdown formatting and code blocks, while not touching the code blocks during autotagging. For example, ``` --- node: today.2022.02.21 --- I figured out how to get Cayley running in production. \``` cayleyd --some flag --other flag \``` ``` The exocortex would see Cayley, identify that as a node, and add the tags for that node to this one. It might see production and add that as a tag, e.g. for ops-related stuff. ### Fast capture I should be able to enter a quick note, which would go under a daily node tree. Something like `quick.2022-02-27.1534`. This would get autotagged. Quick notes might also get a metadata tag indicating whether I went back and integrated them into the rest of the knowledge graph. One way I could use this might be to text or email a note, or to have a quick capture program on my computer. ## Requirements & Assumptions What should it do? What assumptions are being made about it? What's considered "in scope" and what won't the project try to do? Does it need to be compatible with any existing solutions or systems? If it's a daemon, how are you going to manage it? What are the dependencies that are assumed to be available? ## System Design ### Major components The system has two logical components: the artifact repository and the knowledge graph. #### Artifact repository There should be, at a minimum, a local artifact repository. It will have its own tooling and UI for interaction, as well as being linked to the knowledge graph. Previous prototypes stored artifact metadata in SQLite, and the contents of the artifacts in a blob store. The blob store is a content-addressable system for retrieving arbitrary data. A remote option might use an S3-equivalent like Minio. #### Knowledge graph The knowledge graph stores nodes. The current model stores the graph in SQLite, using an external file sync (e.g. syncthing) to sync the databases across machines. ### Data model Previous prototypes used separate SQLite databases for the artifact repository and the knowledge graph. #### Single SQLite database The concern with a single SQLite database is that it would be accessed by two different systems, causing potential locking issues. This could be solved by a single unified backend server; this is the preferred approach. #### Split SQLite databases The original prototype split the databases for performance reasons. However, this was based on any empirical evidence. The major downside to this is that tags and categories are not shared between the artifact repository and the knowledge graph. Categories might make sense for splitting; e.g. an artifact category might be 'PDF' while a node might have the category 'Research'. However, tags should be shared between both systems. #### PostgreSQL database Another option is to to use postgres. This brings a heavy ops cost, while enabling a variety of replication and backup strategies. ### Architectural overview [![The exocortex architecture](/files/i/t/exo-arch.jpg)](/files/i/exo-arch.jpg) There is a backend server, `exod`, that will have a gRPC endpoint for communicating with frontends. The approach allows for a reverse-proxy front end on a public server over Tailscale for remote devices. It also maintains a local blob store, the database, and a connection to a remote minio server for backing up blobs and retrieving missing blobs. If a standard HTTP API is needed, it can be added in later. One potential use for this is for retrieving blobs (e.g. GET /artifacts/blob/id/...). ## Supportability ### Failure scenarios #### Data corruption If the data is corrupted locally, a local import from the remote end would restore it. Alternatively, it may be restored from local backups. If the data is corrupted remotely, a local export to the remote end would restore it. ### Platform support The main program would ideally run on Linux primarily, but I'd like to be able to use it on my Windows desktop too. ### Packaging and deployment ## Security The gRPC endpoint should be authenticated. The system is intended to operate over localhost or a local network, so the use of TLS is probably untenable. [minica](https://github.com/jsha/minica) is an option, but then key rotation needs to be built in. A possible workaround is to only enable authentication (HTTP basic auth will suffice) on the reverse proxy, which will also have TLS. ## Project Dependencies The software should rely on no external sources, except for the software packages that it uses. This can be mitigated with vendoring. ## Open Issues * If I track each time a page was edited, does it make sense to roll this up? e.g. I don't track edits to the second, but maybe to the hour or day. ## Milestones 1. Specifications a. Write up spec for the artifact repository data structures. b. Write up a spec for the knowledge graph data structures. 2. Core systems a. Build the artifact repository server. b. Build the backend for the knowledge graph. c. Build rough CLI interfaces to both. 3. Build the user interfaces. a. Simple note taking. b. Artifact upload and searching by tag, content type, title. ## Review History This may not be applicable, but it's usually nice to have someone else sanity check this. Keep a table of who reviewed the doc and when, for in-person reviews. Consider having at least 1 in-person review.