add stream post, artifact data schema.

This commit is contained in:
Kyle Isom 2022-03-03 22:02:50 -08:00
parent 7c6af747ca
commit 4d52b2734b
3 changed files with 145 additions and 4 deletions

View File

@ -0,0 +1,141 @@
Title: Artifact data scheme
An artifact can be thought of as a source of knowledge. For example, if I am
keeping notes on a research paper, the artifact is that paper.
At a minimum, an artifact should have a standard header with metadata. It
should store some authorship information (e.g. citation information). An
artifact will have snapshots, which indicate current content either at a
specific point in time or in a specific format. A website might have snapshots
for different times it was scraped; a book might have snapshots for different
editions of the book or for different formats (e.g. PDF and EPUB).
## The header
This datatype will be common to all objects, including structures later in the
knowledge graph itself. In many cases, such as a blob, the tags will be empty
as they will be inherited implicitly through the parent Artifact type.
```go
type Header struct {
ID string
Type ObjectType
Created int64
Modified int64
Categories []string
Tags []string
Meta Metadata
}
```
## Metadata
Metadata is a mapping of keys to values. These values might not be integers;
consider the case where we'd want to track filesize or something like that.
Metadata is defined as
```go
type Value struct {
Contents string
Type string
}
type Metadata map[string]Value
```
## Blobs
With these two types defined, we can define a blob. A Blob has a header, a content type, and some data.
```go
type Blob struct {
ID string
Format string // MIME type
Body io.ReadCloser
}
```
## Citations
A citation can be thought of as the bibliographic information for the artifact. Nothing in this should be strictly required. A citation occurs at the artifact level, but it could also occur at the snapshot level. This is like having base information (such as author and publisher) that applies to all of the snapshots, while the snapshot might override attributes like the specific edition.
### Publishers
A starting point is the publisher type.
```go
type Publisher struct {
Header Header
Name string
Address string
}
```
This is simple enough; the publisher really just needs a name and address, and it gets a Header whose Metadata can be used to inject any additional fields.
### Citations defined
Putting some of these pieces together:
```go
type Citation struct {
Header Header
DOI string
Title string
Year int
Published time.Time
Authors []string
Publisher *Publisher
Source string
Abstract string
}
```
We are strictly interested in containing the fields; the presentation layer can handle linking to the DOI, for example.
## Snapshots
So we have the basic pieces in place now to define a snapshot:
```go
type Snapshot struct {
Header Header
ArtifactID string
Stored time.Time
DateTime time.Time
Citation *Citation
Blobs map[MIME]*Blob
}
```
It needs to know the ID of the artifact that it belongs to. We track the time it was stored --- which could be a unix timestamp, but for consistency with the other fields, we'll keep it as a standard time. DateTime is the time used for the snapshot; it can be a built off the year from the citation if needed, or it could be more refined.
One design choice here that could be questioned is the used of the MIME type associated with the blob. The example I can think of here is the [[no-bs-guide-to-math-and-physics]], which has a pair of PDFs; one for reading on a tablet, and one for printing. I think that could be solved by using a [[MIME types|media type]] parameter like "application/pdf; format=screen".
## The artifact type
Combining these together, we have the artifact type itself.
```go
type Artifact struct {
ID string
Type ArtifactType
Latest time.Time // latest snapshot
History map[time.Time]*Snapshot
}
```
The Type is an enumeration that can be added to; a few known types to start with are
* Unknown
* Custom
* Article
* Book
* URL
* Paper
* Video
* Image
If the type is "Custom", the Header should have a metadata entry for "ArtifactType" to custom define it.
The Latest should refer to the most Snapshot.DateTime in its collection of snapshots.
## Timestamps
All timestamps should be suitable for referencing dates prior to epoch 0; they should be encoded in UTC and locally converted. For example, if the client is uploading a new artifact, it should convert its local time to UTC, then send this to the server. We can enforce this in Go using the `Local` timezone, but it's not foolproof.
## Next steps
* Define protobufs.
* Define a SQL schema.

View File

@ -2,3 +2,4 @@ Title: Design docs
Tags: specs
* [Top-level functional spec](/specs/functional.html)
* [Artifact data spec](/specs/artifacts.html)

View File

@ -1,12 +1,11 @@
Title: Stream 0x03
Slug: stream-0x03
Date: 2022-03-01
Modified: 2022-03-01
Date: 2022-03-03
Modified: 2022-03-03 22:03 PST
Category:
Tags: stream
Authors: kyle
Summary: Stream notes for tonight's stream.
Status: draft
Tonight's work focused on adding in a mirror between local storage and
a remote S3 (Minio) instance. The basic flow goes something like:
@ -21,7 +20,7 @@ a remote S3 (Minio) instance. The basic flow goes something like:
8. Launch a goroutine that waits as long as the backoff says before
putting the item back on the work queue.
Next stream (2021/03/03), we'll look at designing the artifact, maybe
Next stream (2021/03/08), we'll look at designing the artifact, maybe
working on some of the protobuf definitions.
### References