Computer Science 310 notes, part 1
Notes from Bruce Bolden’s Computer Science 310 course, “Programming Languages”.
21 January 2005
Smalltalk 80 - bits of history: http://www.iam.unibe.ch/~dvcasse/FreeBooks.html, Glenn Krasner
Declarative Programming
Evaluation of functions over partial data structures. Sometimes called stateless programming, as opposed to statefull (imperative) programming.
Two main declarative paradigms, namely functional and logic programming. It encompasses programming over complete values (Scheme and Lisp). Also encompasses logic programing (Prolog) when search is not used.
- Oz
- Escher
Functional Programming
A style that emphasizes the evolution of expressions rather than commands. The expressions in these languages are formed by using functions to combine basic values. A functional language supports and encourages programming in a functional style.
- Lisp / Scheme
- ML
- Haskell
- Erlang (cell phones)
Correction via Mickael Remond,
“Erlang is not a cell phone language. It’s a general purpose language invented by Ericsson, initially for huge complex telco machine. It is now used for web development, banking application, 3D modeler (Wings3D), etc. It is really ahead of the competition for developping robust, scalable distributed applications.”
Logic Programming
Find solutions to problems given a set of rules.
- Prolog
- Oz
- Godel
- Mercury
Scripting Languages
Typically small and powerful, sometimes cryptic.
- awk
- Perl
- Python
- Ruby
- Parrot (merge Perl and Python)
Markup Languages
Typically used in document preparation.
- nroff / troff
- TeX (LaTeX) (tools MetaFont and MetaPost)
- PostScript
- SGML
- HTML
- XML
Special Purpose Languages
- lex / yacc
- Maple / Mathmatica
- Matlab
- SQL
- VHDL / Verilog
Languages we’ll look at:
- C
- lex and yacc
- Scheme (Common Lisp)
- Java or Objective-C
- Prolog
- Probably one of Perl, Python, or Ruby
Assignment for next week: Read Chapter 1, spend some time looking at languages, review linked lists / recursion
24 January 2005
C
- Developed in the 70’s by Dennis Ritchie
- Unix written in C (a little assembly)
- C is a subset of C++, or C++ is a superset of C (C with classes)
- Still very popular
What is the same (between C and C++)?
- Basic data types
- Comments (C++ added the inline comment //)
- General syntax
- Naming conventions
Basic (primitive) data types
char
int
(short
,long
)double
(float
)
http://google.com/search?q=%22binary+coded+decimal%22
Syntax
- Function definitions (parameter passing is different)
- Conditionals (
if
,switch
) - Loops (
for
,while
,do{ }while
) - Arrays
- Abstract Data Types (
struct
,union
)
Union Example
union {
uint32_t my_int;
unint8_t my_byte[4];
}endian_tester;
endian_tester et;
et.my_int = 0x0a0b0c0d;
if(et.my_byte[0] == 0x0a) {
printf("On a bing-endian system.");
} else {
printf("On a little-endian system.");
}
http://linuxjournal.com/articles.php?sid=6788
What’s different (between C and C++)?
- Compiler invocation (
cc
instead ofCC
,gcc
instead ofg++
) - I/O
- Variables must be declared at the top of a function (before a statement)
- Parameter passing (must use pointers)
- Strings are character arrays
- Memory manipulation
- No generics (templates)
- Macros (C preprocessor)
I/O
- …is function based in C
- input -
scanf
,read
- output -
printf
,write
Reading in a character
scanf("%c", &c);
Reading in integers
scanf("%d%d", &i, &j);
Notice the ampersand for “address of”.
Printing an integer
printf("%3d\n", i);
Notice the format descriptor: %3d
Format characters
%c
- character%d
- integer%e
- single precision (exponential)%f
- single precision%g
- floating point (exponential if needed)%o
- octal%u
- unsigned%x
- hexadecimal%hd
- short%ld
- long%lf
- double%s
- string%%
- literal %%10.3lf
Double with a field width of 10 and a decimal precision of 3.
%20s
String with a filed width of 20.
Macros
#define max(a,b) a<b ? b:a
This makes sense for numbers but not letters. There’s no type checking with macros, so don’t do this.
26 January 2005
File I/O
- All files are represented by one type
FILE*
, is defined in<stdio.h>
- header -
stdio.h
- input -
stdin
- output -
stdout
- error -
stderr
Function definitions
Prototypes were an addition to the ANSI-standard.
/* Old C style */
void PrintInt(a, a)
int a;
char *s;
{
printf("%s: %d", s, a);
}
/* ANSI C sytle */
vaoid PrintInt(int a, char *s)
{
printf("%s: %d", s, a);
}
Use the ANSI style!
Parameter Passing
Variables are passed by name or value. Use pointers to change actual values of a variable.
void Swap(int *a, int *b)
{
int iTmp = *a;
*a = *b;
*b = iTmp;
}
Usage: Swap(&i, &j);
Swaps i
and j
.
File operation code
Opening a file for input:
/* C++ */
ifstream fIn;
fIn.open(fName, ios::in);
if(!fIn)
{
cerr << "Unable to open: " << fName << endl;
exit(-1);
}
/* C */
FILE *fpIn;
fpIn = fopen(fName, "r");
if(fpIn == NULL)
{
printf("Unable to open: %s\n", fName);
exit(-1);
}
Opening a file for output:
/* C++ */
ofstream fOut;
fOut.open(fName, ios::out);
if(!fOut)
{
cerr << "Unable to open: " << fName << endl;
exit(-1);
}
/* C */
FILE *fpOut;
fpOut = fopen(fName, "w");
if(fpOut == NULL)
{
printf("Unable to open: %s\n", fName);
exit(-1);
}
Appending to a file:
/* C++ */
ofstream fOut;
fOut.open(fName, ios::out|ios::app);
if(!fOut)
{
cerr << "Unable to open: " << fName << endl;
exit(-1);
}
/* C */
FILE *fpOut;
fpOut = fopen(fName, "w+");
if(fpOut == NULL)
{
printf("Unable to open: %s\n", fName);
exit(-1);
}
Closing files:
/* C++ */
fIn.close();
fOut.close();
/* C */
fclose(fIn);
fclose(fOut);
Copy a file to standard output
We’re going to do this one character at a time.
#include <stdio.h>
int main(int argc, char **argv)
{
FILE *fp;
int c;
if((fp = fopen(*++argv, "r")) != NULL)
{
while((c = getc(fp)) != EOF)
{
putc(c, stdout);
}
}
fclose(fp);
}
String manipulation
Strings are character arrays in C. The standard string function prototypes are defined in <string.h>
. Typical string manipulation functions: strlen
, strcat
, strcmp
, etc.
Read from / write to strings.
- Read an integer:
sscanf(s, "%d", &i);
- Write an integer:
sprintf(s, "%d", i);
28 January 2005
Dynamic memory
- allocate -
malloc
,alloc
,calloc
- deallocate -
free
Example: One dimensional array manipulation
/* C++ */
int *pA;
pA = new int[N];
delete []pA;
/* C */
int *pA;
pA = (int*)malloc(N * sizeof(int));
free((void*)pA);
Note: Use of casting. Treat dynamically allocated array just as if declared statically.
Example: Two dimensional array manipulation
/* C++ */
int **arr2D;
arr2D = new int *[rsize];
for(int i = 0; i < rsize; i++)
{
arr2D[i] = new int[csize];
if(arr2D[i] == NULL)
{
/* ERROR */
}
}
for(int i = 0; i < rsize; i++)
{
delete arr2D[i];
}
delete arr2D[i];
/* C */
int i;
int **arr2D;
arr2D = (int**)malloc(rsize * sizeof(int*));
for(i = 0; i < rsize; i++)
{
arr2D[i] = (int*)malloc(csize * sizeof(int*));
if(arr2D[i] == NULL)
{
/* ERROR */
}
}
for(i = 0; i < rsize; i++)
{
free((void*)(arr2D[i]));
}
free((void*)arr2D);
This is dependent upon memory being a contiguous chunk. Allocate it all at once otherwise you can’t use arr2D[i][j]
to do accesses.
31 January 2005
Passing around a pointer:
struct Node
{
int x;
struct Node *next;
};
typedef struct Node *NodePtr;
Adding nodes recursively to a linked list:
void AddNodeRecursive(NodePtr *h, int x)
{
if(*h != NULL)
{
AddNodeRecursive(&(*h)->next, x);
}
else
{
NodePtr n;
n = (NodePtr)malloc(sizeof(struct Node));
n->info = x;
n->next - NULL;
*h = n;
}
}
Usage:
NodePtr head = NULL;
AddNodeRecursive(&head, 2);
Programming log
Keep a programming log for all your assignments. Just kind of a day by day journal about what you do and how much time you spend doing it.
Back to the language stuff
What is language? A notation for the transfer of information.
- Natural Language - English, Chinese, etc.
- Artificial - Symbols represent things / ideas
- Representations - Made vs. real world
Humans are language oriented.
Programing languages must be readable: communicate with others / yourself - not just the machine!
Language provides a shorthand for more complex ideas (abstraction).
[insert picture of villagers fighting a bear]
Language is a tool!
- Morse code
- SMS
- Klingon battle language - shorter
- Status - thumbs up, thumbs down
A good notation sets us free to work at a higher level. We don’t want to work at the 1s and 0’s level on a computer.
Saphir-Wharf hypothesis
“The language you know controls what you can think.”
Restriction:
- Newspeak - Orwell’s 1984, control language -> control the way people speak
Expansion:
- Loglan - logical language, promotes logical thinking
Programing language mechanisms
A language should provide a means for combining simple ideas to form more complex ideas. Every programming language has three mechanisms for accomplishing this:
- Primitive expression - represent the simple entities the language is concerned with
- Means of combination - compound elements built from simpler ones
- Means of abstraction - manipulate compound elements as units.
2 February 2005
Program / Machine model
Program is tied to a specific machine. You can’t compile code on Windows and run it on my Mac.
Program / Virtual Machine model
Program runs on a virtual machine and on our actual hardware we have some program that handles the instructions from the virtual machine.
Language categories
- Imperative
- Object-Oriented
- Functional
- Logic
- Scripting
- Markup
- Special purpose
Ten reasons to study programming languages
“The language you know limits your abilities.”
It’d probably be more accurate to say, “The language you think in limits your abilities.”
- Express algorithmic concepts to computers
- Express algorithmic concepts to yourself
- Express algorithmic concepts to others
- Understand how languages work
- Select the right language (for the task)
- Improve algorithmic vocabulary
- Understand domain specific languages
- Easier to learn new languages
- Easier to design new languages
- Communicate effectively (intelligently) with others in the industry (dynamic allocation, garbage collection, etc.)
Book recommendation: Prime Obsession by John Derbyshire
Language classification
- Machine - direct one-to-one correspondence with hardware
- Assembly - symbolic representation of the machine language
- High level - symbols in language map to more than one machine instruction
- General purpose
- Domain specific
Benefits of high level languages
- Less work
- Abstraction
- Portable
- Readable
- Maintainable
- Libraries of code (e.g. OpenGL)
Commonly used languages
- Business
- COBOL
- RPG (Report Generation Language)
- Excel
- Science
- FORTRAN
- Gaming
- C (Quake)
- Lisp
- Python
- Mathematics
- Maple
- Mathematica
- MATLAB
- Statistics
- SPSS
- Statistica
- Simulation
- Simula
- GPSS
- ACSL (Advanced Continuous System Language)
- Publishing
- HTML
- XML
- LaTeX
- TeX
- PostScript
- Expert systems
- OPS5
- Mycin
- Scripting
- Perl
- Tcl
- Python
- Ruby (Rails)
- shells
4 February 2005
Side note: You can use mail bruceb < source.c
from a sun-sol.cs.uidaho.edu box to send source to Bolden.
Paradigms
- Imperative
- Functional
- Object Oriented
- Logic
- Multi-paradigm
Imperative languages
Have / focus on machine state. Focus on linear sequence of actions that transform the state of the machine.
Languages:
- Algol
- Fortran
- C
- Pascal
Functional languages
Function (or applicative) language focus on transforming data by applying functions–not as static oriented.
(Print(Sort(lookup(db, "John"), age), 2, 6))
Languages:
- Lisp (Scheme)
- ML
- Haskell
- Mathematica
Object Oriented languages
OO PLs focus on associating the transformation of data with objects (data types). Actions associated with objects.
Languages:
- Smalltalk (Squeak)
- Java
- C++
- Ruby
Logic languages
Logic (or rule-based) languages
if dog is hungry, feed dog
if dog is thirsty, "water" dog
Languages:
- Prolog
- OPS5
- lex, yacc (language tools)
Multi-paradigm languages
Allow a programmer to use / mix paradigms.
- C++: Imperative, OO
- Leda: Imperative, Functional, Logic (Timothy Budd, Oregon State)
- Oz: Declarative, OO, Logic, Constraint
Side notes
Octal dump:
od a.out
Generate assembly:
gcc -S file.c
Expand only macros:
gcc -E file.c
C preprocessor:
cpp
7 February 2005
[overhead slide of the history of programming languages]
There’s lots of languages, and there’s a heritage there.
Important language concepts
- Humans are language oriented
- Languages must be readable
- Communicate with others / yourself–not just the machine
Program with intent. There’s some intention behind the code. What is it? Make that clear in both the comments and the code itself.
Programing language standards
- Proprietary standards: license usage, etc. Corporate driven.
- Java
- PostScript
- Consensus standards: driven by public organizations, e.g., ANSI, ISO, GNU.
- ANSI C
- Ada
- GNU CC (compiler collection)
Compiler as a standard
- What the compiler says!
- Some use some compiler as the standard for the language.
- Compliance: Given a program which uses just features of the languages -> (yield) correct results
Good language features
- Right level of abstraction
- Natural notation
- Complete enough to do the job
- Self consistent–avoid special cases
- Support extensible abstractions
- Supports coding (useful throughout program life)
- Miscellaneous (standards, portability)
Book recommendation: “Who Moved the Cheese?”
If you want to see how not to write code, visit http://thedailywtf.com/
Principle of least astonishment
You don’t want to be surprised by what the code does.
Translator
Source -> Translator -> Target
- Translator: translates one language to another (source to target)
- Source: code (expressions) in source language
- Target: object code (often not human readable)
Interpreter
Source -> Interpreter (on Machine) -> Object -> Run-time System (on Machine)
- Interpreter: translates source language to intermediate object language then executes it
Java interpreter
Java Source -> Compiler (on Machine) -> Byte code -> JVM (on Machine2)
Compile Java source on one machine run on another with JVM. Version matters!
C# - Intermediate Language
Compiler details
Source -> Compiler -> Target -> Linker -> Libraries -> Loader -> Executable
- Compiler: translates the source language to object language
- Cross compiling: often used for embedded systems
Execution details
# File A:
main()
call A1()
call B2()
A1()
# File B:
B1()
B2()
- Linker / Loader: resolves addresses
- Object code is relocatable
- Independently compilable units desirable
REPL
Typical operation of an interpreter:
Read - Eval - Print - Loop
Interprets code as entered.
Scheme (Lisp), ML work in this manner. Jython (Python with Java) also works in this manner. Fast turn around!
Recursive GCD source
/* Recursive GCD (Greatest Common Divisor) */
int GCD( int m, int n )
{
if( m < n )
return GCD( n, m );
else if
...
}
Iterative GCD code
/* m >= 0, n > 0 */
q = 0;
r = m;
while( r >= n )
{
r = r - n;
q = q + 1;
}
Typical compiler tasks
- Preprocessor (language dependent)
- Tokenize
- Analyze
- Semantic analysis
- Optimization (high level)
- Low level optimization
- Generate assembly / object code
9 February 2005
[insert PDF about the evolution of language]
Typical compiler tasks
- Preprocessor (language independent)
- Tokenize
- Analyze
- Semantic analysis
- Optimization (high level)
- Low level optimization
- Generate Assembly / Object code
Compiler task details
- Structuring
- Lexical analysis
- Scanning
- Conversion
- Syntactic analysis
- Parsing
- Tree construction
- Lexical analysis
- Translation
- Semantic analysis
- Name analysis
- Type analysis
- Transformation
- Data mapping
- Action mapping
- Semantic analysis
- Encoding
- Code generation
- Execution order determination
- Register allocation
- Instruction selection
- Assembly
- Internal address resolution
- External address resolution
- Instruction encoding
- Code generation
Preprocessor
Preprocessor converts HLL -> HLL
- C:
cpp
e.g.,#define
,#include
- Early C++ compilers -> C
- Early Ada compilers -> C
Structuring
Structuring accepts the sequence of characters that constitute the source program and builds a tree that represents it internally.
Lexical analysis
The first step in structuring a source program is lexical analysis, which accepts a sequence of characters and yields a sequence of basic symbols.
Lexical analysis can be broken into tow subtasks:
- Scanning (breaking source into tokens)
- Classification (classify tokens (keyword / variable))
Easy to write (relatively speaking). A number of tools and well known techniques make this possible.
Syntactic analysis
The syntactic analyzer accepts the sequence of basic symbols delivered by the lexical analyzer and builds the source program tree.
- Structure of program
- Symbol table
- Operation precedence
Translation
The translator converts the source program, which is an algorithm stated in terms of source language concepts, into an equivalent algorithm stated in terms of the target language concepts.
Semantic analysis
The purpose of the semantic analyzer is to obtain information that is available at certain places in the program (identifier (variable) declarations) and move that information to other places in the program (identifier uses).
The name analyzer establishes an entry in the definition table data base for each definition and attaches the key to the appropriate definition to teach identifier.
- Name analysis
- Type analysis
Transformation
The transformation process builds the target program tree on the basis of the information contained in the source program tree (as decorated during semantic analysis) and the definition table.
Encoding
Encoding determines the order in which constructs will be executed (subject to constraints in the tree) allocates resources to hold intermediate resources and selects the instructions to carry out instruction encoding in the specified resources.
High level optimization
Eliminate common subexpressions.
x[i] = x[i] + f(3);
x[i] += f(3);
Low level optimization
store r1, M
load M, r1
Summary
A compiler translates a source program into an equivalent target program. The compiler maintains three major data structures:
- A source program tree
- A definition table
- A target program tree
Syntax / Semantics
- Syntax
- The symbols used to write a program
- The language form - what is allowed
- Semantics
- The actions that occur when a program is executed
- The meaning of what is said (difficult)
- Programming language implementation
- Syntax -> Semantics
- Transform program syntax into machine instructions that can be executed to cause the correct sequence of actions to occur
Terminology
- Grammar: formal description of the syntax for a language
- Alphabet: collection of symbols in the language, plus symbols in the grammar
- Production: a rule for replacement of non-terminals with other symbols from the alphabet
- Non-terminals: a subset of the alphabet each having a production
- Terminals: symbols in the alphabet that are non-terminals
- Start Symbol: the place to start
- Parse Tree: internal representation
- Derivation: the creation of a specific parse tree
BNF
Backus Naur Form (Backus Normal Form)
- John Backus - Fortran
- Peter Naur - Algol
- Describes a language formally
- EBNF: An extended form is often used!
More details: http://www.garshol.priv.no/download/text/bnf.html
11 February 2005
- Grammars are often defined using BNF (or EBNF) rules.
- Rules are defined using terminals and non-terminals and special meta-symbols.
BNF meta-symbols:
::=
means “is defined as” (some people use an arrow)|
like “or”, used to separate alternatives< >
(angle brackets) often used to denote non-terminals
Extended BNF (EBNF) makes writing grammars easier.
Any EBNF production (rule) can be translated into an equivalent set of BNF productions. EBNF is not more powerful than BNF, just more convenient.
BNF example:
digit ::= 0|1|2|3|4|5|6|7|8|9
EBNF example:
digit ::= 0-9
What is syntax?
- Expresses the form (or grammar) of a program in a portable language.
- It does not express what a program means.
- In other words, syntax expresses what a program looks like, but not what it does.
The elements of syntax for a language (including programming languages) are:
- An alphabet of symbols that comprise the basic elements of the grammar.
- The symbols are comprised of two sets–terminals and non-terminal symbols.
- Terminals cannot be broken down.
- Non-terminals can be broken down.
- A set of grammar rules that express how symbols can be combined to make legal sentences of the language.
The grammar rules are of the form: non-terminal symbols ::= a list of zero or more terminals or non-terminals.
We use the rules to recognize (parse) and / or generate legal sentences of the language.
Equivalent forms in which to express syntax:
- BNF
- Syntax graphs
- Other slight variations of notation
All notations commonly used for programming language syntax have the same power (i.e., all produce context-free syntax).
A grammar for a small subset of English
Consider the sentence, “Mary hits John.”
A simple grammar that could be used to parse or generate this sentence is:
<sentence> ::= <subject> <predicate> .
<subject> ::= Mary
<predicated> ::= <verb> <object>
<verb> ::= hits
<object> ::= John
How do we use rules to parse or generate a sentence like “Mark hits John.”?
- Consider each rule to be a production that defines how some part of a sentence is recognized or formed.
- A trace of the application of the production of rules can be shown in tree form as follows:
[Visualize this as a tree. Think Study of Languages class, the syntax and grammar stuff from there.]
<sentence> -> <subject> & <predicate>
<subject> -> Mary
<predicate> -> <verb> & <object>
<verb> -> hits
<object> -> John
A number of concepts that enhance the expressibility of a grammar:
- Alternation
- Express languages that have an infinite number of sentences or sentences that are infinitely long.
- The use of recursion in Context Free Grammars (CFGs)
Alternation - add | Steve
to the <object>
Extend by adding additional non-terminal symbols
<subject> ::= Mary | John
<object> ::= Mary | John
We can now write “Mary hits Mary.” or “John hits John.”
A more convenient way is to change rules
<subject> ::= <noun>
...
<object> ::= <noun>
<noun> ::= John | Mary | Steve
Use recursion to form an infinite number:
<object> ::= John | John again | John again and again | ...
Notice the pattern for these infinite length sentences:
<object> ::= John | John <repeat>
<repeat> ::= again | again and <repeat>
14 February 2005
[Bolden’s solution to our programing assignment]
- Do a breakdown - How many lines of code per function / total?
- Variable stuff - How many globals, structs, consts, etc.?
C doesn’t know about true or false:
#define TRUE 1;
#define FALSE 0;
Prototypes
- Group prototypes together via their functions
- I/O stuff
- Stack routines
- Memory stuff
- All variables in the prototypes have names!
16 February 2005
Nerd Quiz
http://www.wxplotter.com/ft_nq.php?im
[graph about optimization] Basically, don’t bother.
Notational Expression Grammar
<expr> ::= <term> | <expr> <add op> <term>
<term> ::= <factor> | <term> <op> <term>
<factor> ::= (<expr>) | <id> | literal
<op> ::= <add op> | <mul op>
<add op> ::= + | -
<mul op> ::= * | /
<id> ::= identifier
C Identifier Grammar (EBNF)
<identifier> ::= <nondigit> (nondigit | digit)
<nondigit> ::= _ | a-z | A-Z
<digit> ::= 0-9
Parse trees
Show how a sentence is derived in a language.
Production: P ::= x1 x2 x3
[picture of tree with P as root node and x1, x2, x3 as leaves]
Definitions of grammar
Formally define a grammar by a 4-tuple.
G = (V<sub>N</sub>, V<sub>T</sub>, S, P)
Where VN and VT are disjointed sets of non-terminal and terminal symbols respectively. S is the distinguished symbol of VN, and is commonly called the start (or goal) symbol. The set of V = VT U VN is called the vocabulary of the grammar. P is a finite set of productions.
Naming: The from of the grammar is important.
- Different grammars can generate the same language.
- Tools are sensitive to the form of the grammar.
- Restriction on the types of rules can make automatic parser generation easier.
Derivation examples
Grammar
E ::= E+E | E*E | (E) | id
id ::= number
Sentential form (input to parser)
E: id * id + id
Leftmost derivation
- In the current “string”, choose leftmost non-terminal.
- Choose any production for the chosen non-terminal.
- In the string, replace the non-terminal by the right hand side of the rule.
- Repeat until no more non-terminals exist.
Example: id * id + id
E
E ::= E + E
E ::= E * E + E
E ::= id * E + E
E ::= id * id + E
E ::= id * id + id
It matches our expression!
Rightmost derivation
- In the current “string”, choose rightmost non-terminal.
- Choose any production for the chosen non-terminal.
- In the string, replace the non-terminal by the right hand side of the rule.
- Repeat until no more non-terminals exist.
Example: id * id + id
E
E ::= E + E
E ::= E + id
E ::= E * E + id
E ::= E * id + id
E ::= id * id + id
It matches our expression!
What if we had chosen to do E * E?
E
E ::= E * E
E ::= id * E
E ::= id * E + E
E ::= id * id + E
E ::= id * id + id
Our grammar is ambiguous. We can get to the same place through several different trees. You don’t want to have an ambiguous grammar!
18 February 2005
Peanuts Grammar
<sentence> ::= <subject> <predicate>
<subject> ::= <article> <noun> | <noun>
<predicate> ::= <ver> <object>
<article> ::= a | the
<noun> ::= Linus | Charlie | Snoopy | blanket | dog | song
<verb> ::= holds | pets | sings
<object> ::= <article> <noun> | <noun>
Top down parsing vs. bottom up parsing
Noam Chomsky - linguist / anarchist
CNF (Chomsky Normal Form)
Top down parsing
Attempt to construct a syntax tree by staring at the root of the tree (start symbol) and proceeding downward toward the leaves (the symbols forming the strings).
<sentence> ==> <subject> <predicate>
==> <noun> <predicate>
==> Linus <predicate>
==> Linus <verb> <object>
==> Linus holds <object>
==> Linus holds <article> <noun>
==> Linus holds the <noun>
==> Linus holds the blanket
You can associate a rule (production) with each of the steps.
[Insert graphic of the parse tree for the sentence above]
Top down, left to right seems common. Do they do things differently in foreign countries?
Bottom up parsing
We try to construct starting from the leaves and moving up to the root.
Linus holds the blanket
<noun> holds the blanket
<subject> holds the blanket
<subject> <verb> the blanket
<subject> <verb> <article> blanket
<subject> <verb> <article> <noun>
<subject> <verb> <object>
<subject> <predicate>
<sentence>
Derivation notes
- A parse tree has
- Terminals at the leaves
- Non-terminals at the interior nodes
- An in-order traversal of the leaves is the original input
- The parse tree shows the association of operations, even if the input string does not
Parsing
Given an expression find tree
- Ambiguity
- Expression 27 - 4 + 3 can be parsed two ways
- Problem: 27 - (4 + 3) != (27 - 4) + 3
- Ways to resolve ambiguity
- Precedence
- Group * before +
- Parse 3 * 4 + 2 as (3 * 4) + 2
- Associativity
- Parenthesize operators of equal precedence to left (or right)
- Parse 3 - 4 + 5 as (3 - 4) + 5
- Precedence
Ambiguous grammars
Language to generate strings of the form anbn
S -> aS<sub>2</sub> | S<sub>1</sub>b
S<sub>1</sub> -> a | aS<sub>1</sub>b
S<sub>2</sub> -> b | aS<sub>2</sub>b
An unambiguous grammar
Language to generate strings of the form anbn
S -> aSb
S -> epsilon
Or more simply,
S -> aSb | ab
23 February 2005
Some discussion of interviewing and what happened at the ACM meeting yesterday.
Lexical analysis
“Programs must be written for people to read, and only incidentally for machines to execute.”
Abelson & Susman, SICP, preface to the first edition
What is lexical analysis?
- Process of breaking input into tokens
- Sometimes called “scanning” or “tokenizing”
- Identifies tokens in input string (stream)
Issues in lexical analysis:
- Lookahead
- Ambiguities
Specifying lexers:
- Regular expressions (next time)
- Examples of regex
Tokens
What’s a token? - A syntactic category
Examples:
- English
- noun
- verb
- adjective
- A programming language
- Identifiers
- Comment
- Keyword
- Whitespace
- Operator
- Numbers (integer, real)
Recall
The goal of lexical analysis is to:
- Partition the input string into lexemes
- Identify the token type, and perhaps value of each lexeme
- Left-to-right scan -> lookahead sometimes required
http://dictionary.reference.com/search?q=lexeme
We still need:
- A way to describe the lexemes of each token
- A way to resolve ambiguities
Example:
- Is
if
two variablesi
andf
? - Is
==
two equal signs= =
?
Examples of lexical analysis
Goal: Partition input string into substrings. The substrings are tokens (or lexemes)
What happens?
if( i <= 9 )
dx = 1;
else
dx = 5;
Output is a string of characters.
i f ( i < = 9 ) \newline
\tab d x = 1 ; \newline
e l s e \newline
\tab d x = 5 ; \newline
Defining tokens more precisely
Token categories correspond to sets of strings, some of them finite, like keywords, but some unbounded (variable / function name).
- Identifier: strings of letters or digits, starting with a letter
- Integer: a non-empty string of digits
- Keyword:
if
,else
, orfor
, etc. - Whitespace: a non-empty sequence of spaces (blanks), tabs, or newlines
How are tokens created?
- Created in a first pass over the source code. The lexer classifies program substrings by role.
- Complex frequently add or additional information (line and column), so errors can be reported by location. This can complicate the construction of a compiler.
- Tokens are read until the end of file is reached Or some error threshold is reached.
Designing a lexical analyzer
- Define a finite set of tokens
- Tokens describe our items of interest
- Depends on language and parser design
Token creation
An implementation reads characters until tin finds the end of a token (longest possible single token). It then returns a package of up to 3 items.
- What kind of token is it? Identifier? Number? Keyword?
- If it is an identifier, which one exactly?
Foo
?Bar
? - Where did this token appear in the source code? (line / column)
- Whitespace and comments are skipped.
Onwards to part 2…