modeling data on the web note 1
what is a (core) data model
- flat: CSV(format是text file), word
- table: relational table(table is the core data structure)
- tree based: XML
- graph based: RDF
Data structure
- unstructured data e.g., image, video, text. These data is informationally opaque(不透明)
- semi-structured data e.g. JSON, XML. These data is informationally transparent(透明)
key aspects of (what is ) data model
- Underlying Data Structure,”CoreDataModel”
- Data Integrity(完整性)
- Data Manipulation(操作)
- Data sharing
- a complete data representation and manipulation approach
- a particular data representation for a domain or application,also called the domain model
data
different shapes
- arrary-like
- tree-like
- Graph-like
- Document-like
different volumes
- small to big
different velocities(速度)
- Static/offline(离线) to streaming(流)
different use patterns
- Many readers/few writers or the reverse or other
polyglot persistence
use a variety of different data storage technologies for different kinds of data
首先考虑如何操作数据,然后才会确定哪种技术是最好的选择。应用程序使用不同的技术基于数据的使用方式来管理它们自己的数据。
即便是single core data model 也会有不同特征,多种domain models,版本。
Tables in a relational model
- table is a set of tuples
- tuple is a set of key-value pairs
操纵CSV
import csv
with open("../Adresses/mod2-uk-500.csv") as csvfile:
line_count = 0
myreader = csv.reader(csvfile, delimiter=',', quotechar='t') for row in myreader:
if line_count == 0:
line_count += 1
else:
print(f' Candidate {line_count}: Firstname {row[0]} Lastname {row[1]} City {row[4]}')
line_count += 1
print(f'Processed {line_count -1} Candidates.')
what is formalism
- syntax: what can we write?
- semantics: what does our writing mean?
- with precise (mathematical) definitions
- be clear about what we mean
- allow the determination of key properties
- to abstract away from particular implementions
Modelling With SQL
- format: 不同的格式(表的设计(模式)不同 format就不同)
- schemas: create statements. eg create table t1. with certain constraints
- query: select statements. e.g. slect name form X
- SQL is (mostly) closed over tables (Functions 例外)
Filtering in SQL
Filtering Columns
Keep all columns:
SELECT * FROM People
Just a single column:
SELECT county FROM People
Multiple columns:
SELECT name, county FROM People
Rename columns:
SELECT street_address AS address FROM People
Filtering rows
Selecting specific tuples
Equality:
SELECT * FROM People WHERE surname = "Smith"
Range:
SELECT * FROM People
WHERE heartrate > 95
Compound criteria:
SELECT * FROM People
WHERE heartrate > 95 AND county="Kent"
Join in SQL
An inner join is a join filtered on common columns
SELECT * FROM People, Phone
INNER JOIN ON People.person_id = Phone.person_id
An outer join is like an inner join but it returns also rows that do not have a match in the other table (it will return also people who have no phone!)
SELECT * FROM People, Phone
RIGHT OUTER JOIN ON People.person_id = Phone.person_id
Cost
- Considering lots of matches (think indexes)
- Generating large intermediate tables
NULL
null is a distinguished value which can mean:
- “Value not yet known”
- “Not applicable to this entity”
- “Value undefined”
- check out LSQL
- Key property: Unequal to everything
Null Unequal to everything So, Null can stand for
- an unknown value
- an attribute that is not applicable to the current row
- the empty set
If you have no nulls in your base tables you can’t get them in tables derived by inner join
So we need outer joins