Skip to main content

A Theory of Information Systems

This chapter documents a theory of information systems, upon which Ampersand generates information systems.

The purpose of an information system is to make data meaningful to its users. Users have their own tasks and responsibilities and may work from different locations and on different moments. This collective use by multiple users serves a purpose which we will loosely call "the business". To preserve meaning, the business maintains semantic constraints on the data, amidst of all changes that are going on around them.

Information systems are typically used by actors (both users and computers) who are distributed and work with data all the time. As a consequence, the data in a system changes continually. In practice, actors "talk to" the system through an ingress mechanism, which connects each user to the right interface(s). The ingress function is provided by the deployment platform, so it is beyond the scope of this paper. Also, every interface runs independent of other interfaces, meaning that each interface can be stopped, (re)started and substituted without disrupting the other interfaces.

Every system contains a dataset, which represents the state of the system. Every interface produces and consumes events that may change the state of the system. This state is represented in a persistent store, aka the database. Events that the system detects may cause the state to change.

Anatomy of an information system

To keep our theory technology independent, datasets are assumed to contain triples. This makes the theory valid for any kind of database that triples can represent, such as SQL databases, object-oriented databases, graph databases, triple stores, and other no-SQL databases. The system's semantics is represented as constraints, which most database management systems refer to as integrity rules. The purpose of integrity rules is to keep the constraints they represent satisfied.

An information system is a combination of dataset, schema, and functionality. Each is defined in a separate section.

Datasets

A dataset D\mathscr{D} describes a set of structured data, which is typically stored persistently in a database of some kind. The notation DS\mathscr{D}_{\mathscr{S}} refers to the dataset of a particular information system S\mathscr{S}. The purpose of a dataset is to describe the data of a system at one point in time. Before defining datasets, let us first define the constituent notions of atom, concept, relation, and triple.

Atoms serve as data elements. Atoms are values without internal structure of interest, meant to represent atomic data elements (e.g. dates, strings, numbers, etc.) in a database. From a business perspective, atoms represent concrete items of the world, such as Peter, 1, or the king of France. By convention throughout the remainder of this paper, variables aa, bb, and cc represent atoms. All atoms are taken from an infinite set called A\mathbb{A}.

Concepts are names that group atoms of the same type. All concepts are taken from an infinite set C\mathbb{C}. C\mathbb{C} and A\mathbb{A} are disjoint. For example, a developer might choose to classify Peter and Melissa as Person, and 074238991 as a TelephoneNumber. In this example, Person and TelephoneNumber are concepts. In the sequel, variables AA, BB, CC, DD will represent concepts.

The relation inst:A×C\text{\it inst}:\mathbb{A}\times\mathbb{C} is the instance relation between atoms and concepts. The expression a inst Ca\ \text{\it inst}\ C means that atom aa is an instance of concept CC. This relation is used in the type system, in which inst\text{\it inst} assigns a concept to every atom in the dataset.

Relations serve to organize and store data, to allow a developer to represent facts. In this paper, variables rr, ss, and dd represent relations. All relations are taken from an infinite set R\mathbb{R}. R\mathbb{R} is disjoint from C\mathbb{C} and A\mathbb{A}. Every relation rr has a name, a source concept, and a target concept. The notation r=nA,Br=\text{\it n}_{\langle{A},{B}\rangle} denotes that relation rr has name nn, source concept AA, and target concept BB.

Triples serve to represent data. A triple is an element of A×R×A\mathbb{A}\times\mathbb{R}\times\mathbb{A}.

A dataset D\mathscr{D} is a tuple T,inst\langle{\mathcal{T}},{\text{\it inst}}\rangle that satisfies:

a,nA,B,bT: a inst A  b inst B\forall\langle{a},{\text{\it n}_{\langle{A},{B}\rangle}},{b}\rangle\in\mathcal{T}:\ a\ \text{\it inst}\ A\ \wedge\ b\ \text{\it inst}\ B

This equation defines what it means for a triple to be well-typed. For example, Peter,phonePerson,TelephoneNumber,074238991\langle{\text{{\tt\small Peter}}},{\text{\it phone}_{\langle{\tt Person},{\tt TelephoneNumber}\rangle}},{\text{{\tt\small 074238991}}}\rangle is a triple. The equation says that Peter is an instance of Person and 074238991 is an instance of TelephoneNumber. In practice, users can say that Peter has telephone number 074238991. So, the "thing" that Peter refers to (which is Peter) has 074238991 as a telephone number. This "meaning from practice" has no consequences in the formal world. Users are entirely free to attach a practical meaning to a triple.

The notations TD\mathcal{T}_{\mathscr{D}} and instD\text{\it inst}_{\mathscr{D}} are used to disambiguate T\mathcal{T} and inst\text{\it inst} when necessary. Every dataset is an element of an infinite set called D\mathbb{D}. To save writing in the sequel, the notation a r ba\ r\ b means that a,r,bT\langle{a},{r},{b}\rangle\in\mathcal{T}.

A relation rr can serve as a container of pairs, as defined by the function popr:DP{A×A}\text{\it pop}_r:\mathbb{D}\rightarrow\cal{P}\{\mathbb{A}\times\mathbb{A}\}. It defines a set of pairs, also known as the population of rr:

popr(D) = {a,b a,r,bTD}\text{\it pop}_{r}(\mathscr{D})\ =\ \{ \langle{a},{b}\rangle\mid\ \langle{a},{r},{b}\rangle\in\mathcal{T}_{\mathscr{D}}\}

So, for every dataset D\mathscr{D}:

a,bpopnA,B(D)  a inst A  b inst B\langle{a},{b}\rangle\in\text{\it pop}_{\text{\it n}_{\langle{A},{B}\rangle}}(\mathscr{D})\ \Rightarrow\ a\ \text{\it inst}\ A\ \wedge\ b\ \text{\it inst}\ B

For a developer, this means that the type of an atom depends only on the relation in which it resides; not on the actual population of the database. This allows for static typing, which has well established advantages for the software engineering process [@HanenbergKRTS14; @Petersen2014].

Schemas

To design an information system, a developer defines a schema in which she defines concepts, relations, and rules. Together, the concepts, relations, and rules capture the semantics of an information system [@Spivak2012]. Each rule in a schema serves to constrain the dataset, to ensure its semantic integrity. Every rule is an element of an infinite set called U\mathbb{U}, which is disjoint from A\mathbb{A}, C\mathbb{C}, R\mathbb{R}, and D\mathbb{D}. In this paper, variables uu and vv represent rules.

A type system in Ampersand [@vdWoude2011] identifies certain rules that are provably false. It also ensures that all rules can be interpreted within the set of relations in the schema. Due to this type system, the Ampersand compiler only generates code for scripts without type errors. As a result, for every rule uu in the schema, there is a predicate satu(D)\text{\it sat}_{u}(\mathbb{D}) that says whether dataset D\mathscr{D} satisfies rule uu. A discussion of the type system is out of scope because this paper only makes use of the predicate satu(D)\text{\it sat}_{u}(\mathbb{D}).

Rules serve as invariants, i.e. constraints that the data in the database must satisfy at all times. At times in which the dataset does not satisfy a rule, that rule is said to be broken. In those cases, something or someone must take action to fix the situation and restore invariance of that rule.

A schema Z\mathscr{Z} is a triple C,R,U\langle{\mathcal{C}},{\mathcal{R}},{\mathcal{U}}\rangle, in which C\mathcal{C} is a finite set of concepts, R\mathcal{R} is a finite set of relations, and U\mathcal{U} is a finite set of rules. The notation ZS\mathscr{Z}_{\mathscr{S}} refers to the schema of information system S\mathscr{S}. Disambiguation sometimes requires to write CZ\mathcal{C}_{\mathscr{Z}}, RZ\mathcal{R}_{\mathscr{Z}}, and UZ\mathcal{U}_{\mathscr{Z}} rather than C\mathcal{C}, R\mathcal{R}, and U\mathcal{U} respectively.

In order to do static type checking, all concepts and relations must be "known" in the schema: nA,BRAC  BCa,nA,B,bTnA,BR\begin{aligned} \text{\it n}_{\langle{A},{B}\rangle}\in\mathcal{R}&\Rightarrow&A\in\mathcal{C}\ \wedge\ B\in\mathcal{C}\\ \langle{a},{\text{\it n}_{\langle{A},{B}\rangle}},{b}\rangle\in\mathcal{T}&\Rightarrow&\text{\it n}_{\langle{A},{B}\rangle}\in\mathcal{R} \end{aligned}

We say that a triple a,r,b\langle{a},{r},{b}\rangle befits schema C,R,U\langle{\mathcal{C}},{\mathcal{R}},{\mathcal{U}}\rangle if it satisfies these requirements.

A special kind of rule is used for specialization. The rule ABA\preceq B (pronounce: AA is a BB) states that any instance of AA is an instance of BB as well.

satAB(D)  a:a inst Aa inst B\text{\it sat}_{A\preceq B}(\mathscr{D})\ \Leftrightarrow\ \forall a: a\ \text{\it inst}\ A\rightarrow a\ \text{\it inst}\ B

This is called specialization, but it is also known as generalization or subtyping. You need specialization to allow statements such as: "An employee is a person" or "A human is a mammal". The specialization rules in a schema form a partial order \preceq, meaning that \preceq is reflexive, transitive, and antisymmetric. Specialization rules are special for two reasons: Firstly, the system keeps ABA\preceq B satisfied automatically by changing inst\text{\it inst} whenever needed. This means that a system that contains rule ABA\preceq B can never be brought into a state in which satAB\text{\it sat}_{A\preceq B} is False. This allows for certain run-time optimizations, which are beyond the scope of this paper. A second property that sets specialization rules apart, is that they play a role in the Ampersand compiler. For example, code that determines equality between atoms only works for atoms that are an instance of the same concept. To ensure that this does not result in problems at run-time, users are required to write all specialization rules explicitly.

As a result, if an atom is an instance of concept AA and ABA\preceq B, this atom has all properties that atoms of type BB have.

Information Systems

This section defines the notion of information system. As before, a suffix disambiguates the elements of this definition when needed.

Definition 1 (information system).

An information system S\mathscr{S} is a tuple D,Z,O,maint\langle\mathscr{D},\mathscr{Z},\mathcal{O},\text{\it maint}\rangle, in which:

  • dataset D=T,inst\mathscr{D}=\langle{\mathcal{T}},{\text{\it inst}}\rangle as defined previously;

  • schema Z=C,R,U\mathscr{Z}=\langle{\mathcal{C}},{\mathcal{R}},{\mathcal{U}}\rangle as defined previously;

  • O\mathcal{O} is a set of roles;

  • maint:O×U\text{\it maint}: \mathcal{O}\times\mathcal{U} is the maintainance relation between roles and rules.

A role is a name that identifies a group of users. It serves as a placeholder for a person or a machine (i.e. an actor) that works with the dataset (i.e. create, read, update, or delete triples). The purpose of a role is to mention an individual user (human) or an automated actor (bot) without knowing who that user is.

The system is at rest when every rule in the schema is satisfied:

uU: satu(D)\forall u\in\mathcal{U}:\ \text{\it sat}_{u}(\mathscr{D})

However, the system continually registers events that insert or delete triples in its dataset, which may cause rules to be broken. This requires a reaction to restore invariance. Example: if a rule says that every request must be acknowledged, the receipt of a request (which is an event) will violate this rule. Sending a message that acknowledges this receipt (another event) will cause the rule to be satisfied again. During the time between the receipt of the request and sending the acknowledgment, the rule is (temporarily) not satisfied.

Ampersand assumes two ways of enforcing satisfaction of rules: automatic or manual. Manual enforcement is specified by the developer by assigning a rule uu to a role oo. Any actor with role oo, whether person or machine, can restore invariance of rule uu by inserting or deleting the right triples. For this reason, every rule is assigned to at least one role by means of the relation maint\text{\it maint}:

uU oO: o maint u\forall u\in\mathcal{U}\ \exists o\in\mathcal{O}:\ o\ \text{\it maint}\ u

o maint uo\ \text{\it maint}\ u means that role oo must keep rule uu satisfied, i.e. to keep satu(D)\text{\it sat}_{u}(\mathscr{D}) true. As long as rule uu is broken, the system should notify actors in the role(s) that are administered in maint\text{\it maint} to do something about it.

Automatic enforcement is specified by the developer with a special syntax, the enforcement rule. An enforcement rule specifies not only the rule, but also the way to restore invariance. The system features an engine that restores invariance of all enforcement rules without unneccessary delay. Whenever the dataset violates an enforcement rule, this engine will make other changes to the dataset to satisfy that rule. Hence, the rule is not satisfied during a small, finite amount of time.

The system contains a dedicated role for the enforcement engine, so every rule has at least one role in the relation maint\text{\it maint}, means that all roles together are keeping all rules satisfied.

Example

Having defined an information system in mathematical terms, let us discuss a small example. It is written in the language Ampersand to make it more appealing to read. Let us first define a dataset of just a handful of triples and three relations.

RELATION takes[Student*Course] =
[ ("Peter", "Management")
; ("Susan", "Business IT")
; ("John", "Business IT")
]

This declaration introduces a relation with the name takes, source concept Student, and target concept Course. The informal meaning of this relation is that it states which students are taking which courses.

The example system also has a second relation that states which modules are part of which course.

RELATION isPartOf[Module*Course] =
[ ("Finance", "Management")
; ("Business Rules", "Business IT")
; ("Business Analytics", "Business IT")
; ("IT-Governance", "Management")
]

The third and last relation states which students are enrolled for which module. It is left empty for now.

RELATION isEnrolledFor[Student*Module]

A rule, EnrollRule completes the semantics It states that a student can enroll for any module that is part of a course he or she takes. In Ampersand, which is a syntactically sugared form of relation algebra [@JoostenRAMiCS2017], each rule has a name and each rule has a role to maintain its invariance:

RULE EnrollRule: isEnrolledFor |- takes;isPartOf~
ROLE Administrator MAINTAINS EnrollRule

The semantics of this rule defines satEnrollRule(D)\text{\it sat}_{\tt EnrollRule}(\mathscr{D}) as:

s,mpopisEnrolledFor(D) cCourse:s isEnrolledFor m  s takes c  m isPartOf c\begin{aligned} \forall \langle{s},{m}\rangle\in\text{\it pop}_{\tt isEnrolledFor}(\mathscr{D})\ \exists c\in\text{\tt Course}:\\ s\ \text{\tt isEnrolledFor}\ m\ \rightarrow\ s\ \text{\tt takes}\ c\ \wedge\ m\ \text{\tt isPartOf}\ c \end{aligned}

Rule EnrollRule is satisfied because relation isEnrolledForStudent,Module\text{\it \tt isEnrolledFor}_{\langle{\tt Student},{\tt Module}\rangle} is empty.

Now let us check the requirements to verify that this example defines an information system. The Ampersand compiler generates a dataset D\mathscr{D}, which contains a set of triples and a relation inst\text{\it inst}. It defines the set of triples T\mathcal{T} as:

"Peter",takesStudent,Course,"Management""Susan",takesStudent,Course,"Business IT""John",takesStudent,Course,"Business IT""Finance",isPartOfModule,Course,"Management""Business Rules",isPartOfModule,Course,"Business IT""Business Analytics",isPartOfModule,Course,"Business IT""IT-Governance",isPartOfModule,Course,"Management"\begin{array}{llr} \langle{\text{\tt "Peter"}},&{\text{\it \text{\tt takes}}_{\langle{\text{\tt Student}},{\text{\tt Course}}\rangle}},&{\text{\tt "Management"}}\rangle\\ \langle{\text{\tt "Susan"}},&{\text{\it \text{\tt takes}}_{\langle{\text{\tt Student}},{\text{\tt Course}}\rangle}},&{\text{\tt "Business IT"}}\rangle\\ \langle{\text{\tt "John"}},&{\text{\it \text{\tt takes}}_{\langle{\text{\tt Student}},{\text{\tt Course}}\rangle}},&{\text{\tt "Business IT"}}\rangle\\ \langle{\text{\tt "Finance"}},&{\text{\it \text{\tt isPartOf}}_{\langle{\text{\tt Module}},{\text{\tt Course}}\rangle}},&{\text{\tt "Management"}}\rangle\\ \langle{\text{\tt "Business Rules"}},&{\text{\it \text{\tt isPartOf}}_{\langle{\text{\tt Module}},{\text{\tt Course}}\rangle}},&{\text{\tt "Business IT"}}\rangle\\ \langle{\text{\tt "Business Analytics"}},&{\text{\it \text{\tt isPartOf}}_{\langle{\text{\tt Module}},{\text{\tt Course}}\rangle}},&{\text{\tt "Business IT"}}\rangle\\ \langle{\text{\tt "IT-Governance"}},&{\text{\it \text{\tt isPartOf}}_{\langle{\text{\tt Module}},{\text{\tt Course}}\rangle}},&{\text{\tt "Management"}}\rangle \end{array}

The relation inst\text{\it inst} contains the pairs:

"Finance",Module"BusinessRules",Module"BusinessAnalytics",Module"ITGovernance",Module"Management",Course"BusinessIT",Course"Peter",Student"Susan",Student"John",Student\begin{array}{lcr} \langle{\tt "Finance"}&,&{\tt Module}\rangle\\ \langle{\tt "Business Rules"}&,&{\tt Module}\rangle\\ \langle{\tt "Business Analytics"}&,&{\tt Module}\rangle\\ \langle{\tt "IT-Governance"}&,&{\tt Module}\rangle\\ \langle{\tt "Management"}&,&{\tt Course}\rangle\\ \langle{\tt "Business IT"}&,&{\tt Course}\rangle\\ \langle{\tt "Peter"}&,&{\tt Student}\rangle\\ \langle{\tt "Susan"}&,&{\tt Student}\rangle\\ \langle{\tt "John"}&,&{\tt Student}\rangle \end{array}

The tuple T,inst\langle{\mathcal{T}},{\text{\it inst}}\rangle is well-typed, so this is a dataset D\mathscr{D} as introduced before.

The Ampersand compiler generates a schema Z\mathscr{Z}, which contains concepts, relations and rules. It defines the set of concepts:

C={Module,Course,Student}\mathcal{C}=\{ {\tt Module}, {\tt Course}, {\tt Student}\}

It defines the set of relations:

R = {takesStudent,Course,isPartOfModule,Course,isEnrolledForStudent,Module }\begin{array}{ll} \mathcal{R}\ =\ \{&\text{\it \tt takes}_{\langle{\tt Student},{\tt Course}\rangle},\\ &\text{\it \tt isPartOf}_{\langle{\tt Module},{\tt Course}\rangle},\\ &\text{\it \tt isEnrolledFor}_{\langle{\tt Student},{\tt Module}\rangle}\ \} \end{array}

And, it defines the set of rules U\mathcal{U} to contain just one rule: EnrollRule. This example contains no specialization. So, the schema Z=C,R,U\mathscr{Z}=\langle{\mathcal{C}},{\mathcal{R}},{\mathcal{U}}\rangle satisfies all requirements mentioned above.

Now let us check the definition of information system. Ampersand generates a set of roles O={Administrator}\mathcal{O}=\{{\tt Administrator}\}. The relation maint\text{\it maint} contains one pair only, Administrator,EnrollRule\langle{{\tt Administrator}},{{\tt EnrollRule}}\rangle, instructing the run-time engine to forward all violations of EnrollRule to an administrator so she can keep EnrollRule satisfied. We can see that the only rule, EnrollRule, is satisfied, which completes the requirements for S=D,Z,O,maint\mathscr{S}=\langle\mathscr{D},\mathscr{Z},\mathcal{O},\text{\it maint}\rangle. This concludes the argument that S\mathscr{S} is an information system.

Relation to other theories

The following table compares the language used in the world of information systems with related worlds. It is compared with Ampersand, because we use Ampersand to design information systems. It is compared with the world of software, because Ampersand generates software. It is compared with the world of model theory [ref required], because the formal theory of Ampersand can be understood formally in model theory.

information systemAmpersandsoftwaremodel theory
rulesscriptprogramtheory
datapopulationstatemodel
formal statementtermconditionterm
generatorgeneratorcompiler
storedatabasedatabase
interfaceinterfaceinterface

The following theoretical topics are relevant for Ampersand.

  1. Relation Algebra [Maddux 2006] Ampersand uses heterogeneous relation algebras with specialization as a language to specify information systems. This is existing theory. It is relevant for people who specify information systems. It allows them to formalize business rules. Thus, they can ensure on design time that an information system complies with business rules.

  2. Rewrite systems The Ampersand compiler manipulates relation algebra terms. For this purpose, it uses rewrite systems. This is existing theory. It is relevant for people who make the Ampersand-compiler generate efficient code.

  3. Type systems The Ampersand compiler ensures that every type-correct specification can be built [1]. For this purpose it has a type system, which signals type-errors in Ampersand-scripts. The Ampersand-compiler generates code only for scripts without type errors.

  4. Generating computations It is possible to generate code to keep constraints satisfied. The theory for this is under development. It is relevant for automating tasks in information systems.

  5. Compiler theory Ampersand is a compiler. Its syntax is parsed by the Parsec-module of Haskell. A hand-written generator embodies the semantics. The theory is existing. This subject is relevant for people who wish to change the language of Ampersand.

Each of the following sections treats on of the topics mentioned above (work to be done).

[1] Michels, G., Joosten, S., van der Woude, J., Joosten, S.: Ampersand: ApplyingRelation Algebra in Practice. In: de Swart, H. (ed.) RAMICS 2011. LNCS, vol. 6663,pp. 280–293. Springer, Heidelberg (2011)