exo-notes/content/pages/specs/artifacts.md

4.5 KiB

Title: Artifact data scheme

An artifact can be thought of as a source of knowledge. For example, if I am keeping notes on a research paper, the artifact is that paper.

At a minimum, an artifact should have a standard header with metadata. It should store some authorship information (e.g. citation information). An artifact will have snapshots, which indicate current content either at a specific point in time or in a specific format. A website might have snapshots for different times it was scraped; a book might have snapshots for different editions of the book or for different formats (e.g. PDF and EPUB).

The header

This datatype will be common to all objects, including structures later in the knowledge graph itself. In many cases, such as a blob, the tags will be empty as they will be inherited implicitly through the parent Artifact type.

type Header struct {
	ID         string
	Type       ObjectType
	Created    int64
	Modified   int64
	Categories []string
	Tags       []string
	Meta       Metadata
}

Metadata

Metadata is a mapping of keys to values. These values might not be integers; consider the case where we'd want to track filesize or something like that. Metadata is defined as

type Value struct {
	Contents string
	Type     string
}

type Metadata map[string]Value

Blobs

With these two types defined, we can define a blob. A Blob has a header, a content type, and some data.

type Blob struct {
	ID     string
	Format string // MIME type
	Body   io.ReadCloser
}

Citations

A citation can be thought of as the bibliographic information for the artifact. Nothing in this should be strictly required. A citation occurs at the artifact level, but it could also occur at the snapshot level. This is like having base information (such as author and publisher) that applies to all of the snapshots, while the snapshot might override attributes like the specific edition.

Publishers

A starting point is the publisher type.

type Publisher struct {
	Header  Header
	Name    string
	Address string
}

This is simple enough; the publisher really just needs a name and address, and it gets a Header whose Metadata can be used to inject any additional fields.

Citations defined

Putting some of these pieces together:

type Citation struct {
	Header    Header
	DOI       string
	Title     string
	Year      int
	Published time.Time
	Authors   []string
	Publisher *Publisher
	Source    string
	Abstract  string
}

We are strictly interested in containing the fields; the presentation layer can handle linking to the DOI, for example.

Snapshots

So we have the basic pieces in place now to define a snapshot:

type Snapshot struct {
	Header     Header
	ArtifactID string
	Stored     time.Time
	DateTime   time.Time
	Citation   *Citation
	Blobs      map[MIME]*Blob
}

It needs to know the ID of the artifact that it belongs to. We track the time it was stored --- which could be a unix timestamp, but for consistency with the other fields, we'll keep it as a standard time. DateTime is the time used for the snapshot; it can be a built off the year from the citation if needed, or it could be more refined.

One design choice here that could be questioned is the used of the MIME type associated with the blob. The example I can think of here is the no-bs-guide-to-math-and-physics, which has a pair of PDFs; one for reading on a tablet, and one for printing. I think that could be solved by using a MIME types parameter like "application/pdf; format=screen".

The artifact type

Combining these together, we have the artifact type itself.

type Artifact struct {
	ID      string
	Type    ArtifactType
	Latest  time.Time // latest snapshot
	History map[time.Time]*Snapshot
}

The Type is an enumeration that can be added to; a few known types to start with are

  • Unknown
  • Custom
  • Article
  • Book
  • URL
  • Paper
  • Video
  • Image

If the type is "Custom", the Header should have a metadata entry for "ArtifactType" to custom define it.

The Latest should refer to the most Snapshot.DateTime in its collection of snapshots.

Timestamps

All timestamps should be suitable for referencing dates prior to epoch 0; they should be encoded in UTC and locally converted. For example, if the client is uploading a new artifact, it should convert its local time to UTC, then send this to the server. We can enforce this in Go using the Local timezone, but it's not foolproof.

Next steps

  • Define protobufs.
  • Define a SQL schema.