A datasource can be anything from a simple text file to a big database. The raw data can come from observation logs, sensors, transactions, or user’s behaviour.
Types of Datasources:
- Open data
- Text files
- Excel files
- SQL databases
- NoSQL databases
- Web scraping
Open data – Open data is data that can be used, re-use, and redistributed freely by anyone for any purpose.
Text files – Large amounts of data come in text format from logs, sensors, e-mails, and transactions. There are several formats for text files such as CSV (comma delimited), TSV (tab delimited), Extensible Markup Language (XML) and (JSON) (see the Data formats section).
Excel files – In fact Excel has some good points such as filtering, aggregation functions, and using Visual Basis for Application you can make Structured Query Language (SQL) —such as queries with the sheets or with an external database. We can easily transform Excel files (.xls) into another text file format such as CSV, TSV, or even XML.
SQL databases – A database is an organized collection of data. SQL is a database language for managing and manipulating data in Relational Database Management Systems (RDBMS). The Database Management Systems (DBMS) are responsible for maintaining the integrity and security of stored data, and for recovering information if the system fails.
SQL Language is split into two subsets of instructions, the Data Definition Language (DDL) and Data Manipulation Language (DML). The data is organized in schemas (database) and divided into tables related by logical relationships, where we can retrieve the data by making queries to the main schema
DDL allows us to create, delete, and alter database tables. We can also define keys to specify relationships between tables, and implement constraints between database tables.
- CREATE TABLE : This command creates a new table
- ALTER TABLE : This command alters a table
- DROP TABLE : This command deletes a table
DML is a language which enables users to access and manipulate data.
- SELECT : This command retrieves data from the database
- INSERT INTO : This command inserts new data into the database
- UPDATE : This command modifies data in the database
- DELETE : This command deletes data in the database
NoSQL databases – Not only SQL (NoSQL) is a term used in several technologies where the nature of the data does not require a relational model. NoSQL technologies allow working with a huge quantity of data, higher availability, scalability, and performance.
The most common types of NoSQL data stores are:
- Document store: Data is stored and organized as a collection of documents. The model schema is flexible and each collection can handle any number of fields. For example, MongoDB uses a document of type BSON (binary format of JSON) and CouchDB uses a JSON document.
- Key-value store: Data is stored as key-value pairs without a predefined schema. Values are retrieved from their keys. For example, Apache Cassandra, Dynamo, HBase, and Amazon SimpleDB.
- Graph-based store: Data is stored in graph structures with nodes, edges, and properties using the computer science graph theory for storing and retrieving data. These kinds of databases are excellent to represent social network relationships. For example, Neo4js, InfoGrid, and Horton.
Multimedia – Datasources include directly perceivable media such as audio, image, and video. Some of the applications for these kinds of datasources are as follows:
- Content-based image retrieval
- Content-based video retrieval
- Movie and video classification
- Face recognition
- Speech recognition
- Audio and music classification
Web scraping – Web scraping refers to an application that processes the HTML of a web page to extract data for manipulation.