From 4d52b2734b3f00e652b37846eee7f329f2baa5b8 Mon Sep 17 00:00:00 2001 From: Kyle Isom Date: Thu, 3 Mar 2022 22:02:50 -0800 Subject: [PATCH] add stream post, artifact data schema. --- content/pages/specs/artifacts.md | 141 +++++++++++++++++++++++++++++++ content/pages/specs/index.md | 1 + content/posts/stream-0x03.md | 7 +- 3 files changed, 145 insertions(+), 4 deletions(-) create mode 100644 content/pages/specs/artifacts.md diff --git a/content/pages/specs/artifacts.md b/content/pages/specs/artifacts.md new file mode 100644 index 0000000..2fa278a --- /dev/null +++ b/content/pages/specs/artifacts.md @@ -0,0 +1,141 @@ +Title: Artifact data scheme + +An artifact can be thought of as a source of knowledge. For example, if I am +keeping notes on a research paper, the artifact is that paper. + +At a minimum, an artifact should have a standard header with metadata. It +should store some authorship information (e.g. citation information). An +artifact will have snapshots, which indicate current content either at a +specific point in time or in a specific format. A website might have snapshots +for different times it was scraped; a book might have snapshots for different +editions of the book or for different formats (e.g. PDF and EPUB). + +## The header + +This datatype will be common to all objects, including structures later in the +knowledge graph itself. In many cases, such as a blob, the tags will be empty +as they will be inherited implicitly through the parent Artifact type. + +```go +type Header struct { + ID string + Type ObjectType + Created int64 + Modified int64 + Categories []string + Tags []string + Meta Metadata +} +``` + +## Metadata + +Metadata is a mapping of keys to values. These values might not be integers; +consider the case where we'd want to track filesize or something like that. +Metadata is defined as + +```go +type Value struct { + Contents string + Type string +} + +type Metadata map[string]Value +``` + +## Blobs + +With these two types defined, we can define a blob. A Blob has a header, a content type, and some data. + +```go +type Blob struct { + ID string + Format string // MIME type + Body io.ReadCloser +} +``` + +## Citations +A citation can be thought of as the bibliographic information for the artifact. Nothing in this should be strictly required. A citation occurs at the artifact level, but it could also occur at the snapshot level. This is like having base information (such as author and publisher) that applies to all of the snapshots, while the snapshot might override attributes like the specific edition. + +### Publishers +A starting point is the publisher type. + +```go +type Publisher struct { + Header Header + Name string + Address string +} +``` + +This is simple enough; the publisher really just needs a name and address, and it gets a Header whose Metadata can be used to inject any additional fields. + +### Citations defined +Putting some of these pieces together: +```go +type Citation struct { + Header Header + DOI string + Title string + Year int + Published time.Time + Authors []string + Publisher *Publisher + Source string + Abstract string +} +``` + +We are strictly interested in containing the fields; the presentation layer can handle linking to the DOI, for example. + +## Snapshots +So we have the basic pieces in place now to define a snapshot: + +```go +type Snapshot struct { + Header Header + ArtifactID string + Stored time.Time + DateTime time.Time + Citation *Citation + Blobs map[MIME]*Blob +} +``` + +It needs to know the ID of the artifact that it belongs to. We track the time it was stored --- which could be a unix timestamp, but for consistency with the other fields, we'll keep it as a standard time. DateTime is the time used for the snapshot; it can be a built off the year from the citation if needed, or it could be more refined. + +One design choice here that could be questioned is the used of the MIME type associated with the blob. The example I can think of here is the [[no-bs-guide-to-math-and-physics]], which has a pair of PDFs; one for reading on a tablet, and one for printing. I think that could be solved by using a [[MIME types|media type]] parameter like "application/pdf; format=screen". + +## The artifact type +Combining these together, we have the artifact type itself. + +```go +type Artifact struct { + ID string + Type ArtifactType + Latest time.Time // latest snapshot + History map[time.Time]*Snapshot +} +``` + +The Type is an enumeration that can be added to; a few known types to start with are +* Unknown +* Custom +* Article +* Book +* URL +* Paper +* Video +* Image + +If the type is "Custom", the Header should have a metadata entry for "ArtifactType" to custom define it. + +The Latest should refer to the most Snapshot.DateTime in its collection of snapshots. + +## Timestamps +All timestamps should be suitable for referencing dates prior to epoch 0; they should be encoded in UTC and locally converted. For example, if the client is uploading a new artifact, it should convert its local time to UTC, then send this to the server. We can enforce this in Go using the `Local` timezone, but it's not foolproof. + +## Next steps +* Define protobufs. +* Define a SQL schema. diff --git a/content/pages/specs/index.md b/content/pages/specs/index.md index 7653760..c1bf718 100644 --- a/content/pages/specs/index.md +++ b/content/pages/specs/index.md @@ -2,3 +2,4 @@ Title: Design docs Tags: specs * [Top-level functional spec](/specs/functional.html) +* [Artifact data spec](/specs/artifacts.html) diff --git a/content/posts/stream-0x03.md b/content/posts/stream-0x03.md index bb0f24f..3a36603 100644 --- a/content/posts/stream-0x03.md +++ b/content/posts/stream-0x03.md @@ -1,12 +1,11 @@ Title: Stream 0x03 Slug: stream-0x03 -Date: 2022-03-01 -Modified: 2022-03-01 +Date: 2022-03-03 +Modified: 2022-03-03 22:03 PST Category: Tags: stream Authors: kyle Summary: Stream notes for tonight's stream. -Status: draft Tonight's work focused on adding in a mirror between local storage and a remote S3 (Minio) instance. The basic flow goes something like: @@ -21,7 +20,7 @@ a remote S3 (Minio) instance. The basic flow goes something like: 8. Launch a goroutine that waits as long as the backoff says before putting the item back on the work queue. -Next stream (2021/03/03), we'll look at designing the artifact, maybe +Next stream (2021/03/08), we'll look at designing the artifact, maybe working on some of the protobuf definitions. ### References