add stream post, artifact data schema.

2022-03-03 22:02:50 -08:00 · 2022-03-03 22:02:50 -08:00 · 4d52b2734b
parent 7c6af747ca
commit 4d52b2734b
3 changed files with 145 additions and 4 deletions
--- a/content/pages/specs/artifacts.md
+++ b/content/pages/specs/artifacts.md
@ -0,0 +1,141 @@
+Title: Artifact data scheme
+
+An artifact can be thought of as a source of knowledge. For example, if I am
+keeping notes on a research paper, the artifact is that paper.
+
+At a minimum, an artifact should have a standard header with metadata. It
+should store some authorship information (e.g. citation information). An
+artifact will have snapshots, which indicate current content either at a
+specific point in time or in a specific format. A website might have snapshots
+for different times it was scraped; a book might have snapshots for different
+editions of the book or for different formats (e.g. PDF and EPUB).
+
+## The header
+
+This datatype will be common to all objects, including structures later in the
+knowledge graph itself. In many cases, such as a blob, the tags will be empty
+as they will be inherited implicitly through the parent Artifact type.
+
+```go
+type Header struct {
+	ID         string
+	Type       ObjectType
+	Created    int64
+	Modified   int64
+	Categories []string
+	Tags       []string
+	Meta       Metadata
+}
+```
+
+## Metadata
+
+Metadata is a mapping of keys to values. These values might not be integers;
+consider the case where we'd want to track filesize or something like that.
+Metadata is defined as 
+
+```go
+type Value struct {
+	Contents string
+	Type     string
+}
+
+type Metadata map[string]Value
+```
+
+## Blobs
+
+With these two types defined, we can define a blob. A Blob has a header, a content type, and some data.
+
+```go
+type Blob struct {
+	ID     string
+	Format string // MIME type
+	Body   io.ReadCloser
+}
+```
+
+## Citations
+A citation can be thought of as the bibliographic information for the artifact. Nothing in this should be strictly required. A citation occurs at the artifact level, but it could also occur at the snapshot level. This is like having base information (such as author and publisher) that applies to all of the snapshots, while the snapshot might override attributes like the specific edition.
+
+### Publishers
+A starting point is the publisher type.
+
+```go
+type Publisher struct {
+	Header  Header
+	Name    string
+	Address string
+}
+```
+
+This is simple enough; the publisher really just needs a name and address, and it gets a Header whose Metadata can be used to inject any additional fields.
+
+### Citations defined
+Putting some of these pieces together:
+```go
+type Citation struct {
+	Header    Header
+	DOI       string
+	Title     string
+	Year      int
+	Published time.Time
+	Authors   []string
+	Publisher *Publisher
+	Source    string
+	Abstract  string
+}
+```
+
+We are strictly interested in containing the fields; the presentation layer can handle linking to the DOI, for example.
+
+## Snapshots
+So we have the basic pieces in place now to define a snapshot:
+
+```go
+type Snapshot struct {
+	Header     Header
+	ArtifactID string
+	Stored     time.Time
+	DateTime   time.Time
+	Citation   *Citation
+	Blobs      map[MIME]*Blob
+}
+```
+
+It needs to know the ID of the artifact that it belongs to. We track the time it was stored --- which could be a unix timestamp, but for consistency with the other fields, we'll keep it as a standard time. DateTime is the time used for the snapshot; it can be a built off the year from the citation if needed, or it could be more refined.
+
+One design choice here that could be questioned is the used of the MIME type associated with the blob. The example I can think of here is the [[no-bs-guide-to-math-and-physics]], which has a pair of PDFs; one for reading on a tablet, and one for printing. I think that could be solved by using a [[MIME types|media type]] parameter like "application/pdf; format=screen".
+
+## The artifact type
+Combining these together, we have the artifact type itself.
+
+```go
+type Artifact struct {
+	ID      string
+	Type    ArtifactType
+	Latest  time.Time // latest snapshot
+	History map[time.Time]*Snapshot
+}
+```
+
+The Type is an enumeration that can be added to; a few known types to start with are
+* Unknown
+* Custom
+* Article
+* Book
+* URL
+* Paper
+* Video
+* Image
+
+If the type is "Custom", the Header should have a metadata entry for "ArtifactType" to custom define it.
+
+The Latest should refer to the most Snapshot.DateTime in its collection of snapshots.
+
+## Timestamps
+All timestamps should be suitable for referencing dates prior to epoch 0; they should be encoded in UTC and locally converted. For example, if the client is uploading a new artifact, it should convert its local time to UTC, then send this to the server. We can enforce this in Go using the `Local` timezone, but it's not foolproof.
+
+## Next steps
+* Define protobufs.
+* Define a SQL schema.
--- a/content/pages/specs/index.md
+++ b/content/pages/specs/index.md
@ -2,3 +2,4 @@ Title: Design docs
 Tags: specs

 * [Top-level functional spec](/specs/functional.html)
+* [Artifact data spec](/specs/artifacts.html)
--- a/content/posts/stream-0x03.md
+++ b/content/posts/stream-0x03.md
@ -1,12 +1,11 @@
 Title: Stream 0x03
 Slug: stream-0x03
-Date: 2022-03-01
-Modified: 2022-03-01
+Date: 2022-03-03
+Modified: 2022-03-03 22:03 PST
 Category: 
 Tags: stream
 Authors: kyle
 Summary: Stream notes for tonight's stream.
-Status: draft

 Tonight's work focused on adding in a mirror between local storage and
 a remote S3 (Minio) instance. The basic flow goes something like:
@ -21,7 +20,7 @@ a remote S3 (Minio) instance. The basic flow goes something like:
 8. Launch a goroutine that waits as long as the backoff says before
   putting the item back on the work queue.

-Next stream (2021/03/03), we'll look at designing the artifact, maybe
+Next stream (2021/03/08), we'll look at designing the artifact, maybe
 working on some of the protobuf definitions.

 ### References