%% This LaTeX-file was created by <ceci> Tue May 12 11:26:19 1998
%% LyX 0.12 (C) 1995-1998 by Matthias Ettrich and the LyX Team

%% Do not edit this file unless you know what you are doing.
\documentclass{article}
\usepackage[T1]{fontenc}

\makeatletter


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
\newcommand{\LyX}{L\kern-.1667em\lower.25em\hbox{Y}\kern-.125emX\spacefactor1000}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
\newenvironment{lyxcode}
  {\begin{list}{}{
    \setlength{\rightmargin}{\leftmargin}
    \raggedright
    \setlength{\itemsep}{0pt}
    \setlength{\parsep}{0pt}
    \ttfamily}%
   \item[]}
  {\end{list}}

\makeatother

\begin{document}


\title{Implementing Write Support in dtfs}
\smallskip{}


\author{Christian Czezatke}

\maketitle
\begin{abstract}
This document gives a short overview of the support for read/write operations
in dtfs. It shows how the dtfs core can be implemented without making any assumptions
about the traditional file system's indirect addressing scheme. It is recommended
to have a look at chapter 1 of ``Implementing dtfs'', altough this document
is not completely up-to-date anymore.
\end{abstract}
\tableofcontents


\section{Integrating dtfs Writes in Linux}
\smallskip{}

dtfs reads blocks from the device by means of the traditional Linux buffer cache.
However, no write access is made by using the traditional Linux buffer cache.
dtfs implements its own write cache, the segment manager. This is necessary
because of two reasons:
\smallskip{}

\begin{enumerate}
\item It would be very hard to implement the special write semantics that are required
for a log-structured file system by using the Linux buffer cache for writes.
\smallskip{}

The reason for that is that the Linux buffer cache is implemented in a way that
makes it impossible for the file system to determine the actual sequence of
writes that is performed on the device. However, such a control over the actual
write taking place is required because
\smallskip{}

\begin{itemize}
\item a log-structured file system has to guarantee a transaction-like semantic when
writing out a partial segment. \smallskip{}

\item a log-structured file system has to write out data in large chunks (and not
just block by block) in order to achieve good performance.\smallskip{}

\end{itemize}
\item When using the traditional Linux buffer cache, a block must be addressed by
the device it belongs to and its \emph{physical} address on that device. However,
this addressing scheme causes several problems when used with dtfs that will
be discussed in this article.\smallskip{}

\end{enumerate}
In the long run the standard Linux buffer cache and dtfs's writte cache should
be merged again.


\section{dtfs and Traditional Filesystem's Indirect Data}
\smallskip{}


\subsection*{Dealing With Different Kinds Of Indirect Data}

A filesystem must use some kind of indirect information to refer to the actual
data blocks in a file. The problem is that different kinds of traditional file
systems use different ways to represent this information. Traditional Unix file
systems use an addressing scheme that involves indirect blocks containing pointers
to the actual data blocks, while other file systems use different structures
to represent this information.
\smallskip{}

Since the dtfs core should be able to support different traditional file systems
it would be beneficial if the dtfs core could be implemented in a way that avoids
making any assumption about the structure of this indirect information.
\smallskip{}

Furthermore, the Linux VFS layer does not make any assumption about this indirect
information too. Blocks within a file are addressed by the inode they belong
to and their logical block number within the file. 
\smallskip{}

The BSD LFS implementation uses a special technique to map the indirect blocks
used by the FFS into the 4.4BSD cache that accesses blocks by the vnode number
of their owner and their logical offset within the file (which is essentially
almost the same view that is used by the Linux VFS interface).
\smallskip{}

Since the Linux VFS layer does not deal with indirect information directly,
there should be no need to deal with the indirect information within the generic
dtfs code, too. So the handling of indirect information should be left to the
filesystem-specific part only.
\smallskip{}


\subsection*{The Role Of the Segment Manager}

Dirty file blocks are accumulated in the segment manager. The segment manager
must be able to provide the following services to other layers of dtfs:

\begin{itemize}
\item Fast access to a (modified) block of a file indexed by the inode number it belongs
to and the logical block number within the file.
\item Access blocks in a file-by-file way: It must be possible to retrieve all the
dirty blocks of a file ordered by their logical block number. This is required
for writing out the blocks to a partial segment.
\end{itemize}

\section{Integrating Reads and Writes}
\smallskip{}


\subsection*{Accessing blocks through the Address Translation Layer (ATL)}

Every read operation performed by a tradfs using the dtfs layer must be performed
by calling either
\smallskip{}

\texttt{atl\_bread(uint~inode\_num,~uint32~log\_blocknum,~int~create);}
\smallskip{}

or
\smallskip{}

\texttt{atl\_getblk(uint~inode\_num,~uint32~log\_blocknum);}
\smallskip{}

The difference between these two calls is that a call to \texttt{atl\_getblk}
indicates that the caller is not really interested in the actual content of
the file's block (because he is going to overwrite the block anyway, for example\ldots{})
but needs a handle for the block.\footnote{
Check with getblk implementation of the kernel.
} 
\smallskip{}

Locking blocks is not necessary since the kernel serializes write access to
an inode and all the data structures associated with it in a higher layer (have
a look at fs/read\_write.c) by using semaphores.\footnote{
However, in on an SMP system, writes to different files can happen synchronously,
so care should be taken when updating the segment manager's data structures.
} 
\smallskip{}


\subsection*{Writing Out A Segment -- An Outline}

Every write operation to a file will finally happen to a block that has been
accessed by a call to \texttt{atl\_bread} with the\char`\"{}want\_modify\char`\"{}
flag set or by a call to \texttt{atl\_getblk}. So all dirty blocks are aggregated
within the segment manager.
\smallskip{}

The segment manager groups dirty blocks by the file they belong to. Since inodes
are mapped into files, a modification to an inode will finally boil down to
a simple file operation too. When a segment gets written out, the dirty blocks
get written on a per-file basis. The segment manager writes out the data dirty
blocks and adds their new physical location to a database within the address
translation layer.
\smallskip{}

When the writing of dirty data blocks for a certain file is finished, a function
in the filesystem abstraction layer (FAL) will be called that rebuilds the indirect
data of that file and writes it out. The physical address of these indirect
data blocks is added to a special address space in the fixup database of the
address translation layer. 
\smallskip{}

After that the next dirty file is written out.
\smallskip{}


\subsection*{Delayed Indirect Information Update}

When dtfs runs out of free blocks in the partial segment before all dirty files
are written out, it may decide to continue writing out the data in another partial
segment or may decide to delay the pending write operations further. 
\smallskip{}

No special measures must be taken in the first case when dtfs continues to write
out dirty blocks into a new partial segment.

However, the second case is a bit more complicated to deal with: In this case
a situation may arise in which some data blocks of a dirty file are already
written out (so that they are not in the segment manager any more), but not
the updated indirect data. 
\smallskip{}

In order to be able to locate these new data blocks while the metadata information
is not written out, the entries for these blocks in the address translation
layer must remain there until all the metadata information is finally written
out.\footnote{
However, all this could be avoided if we decide to write out \emph{all} the
available dirty blocks once we have started writing. This has the drawback that
many partial segments will be written that are smaller than a logical segment.
However, maybe we can simply live with it since this means only one additional
block to be written out.
} 
\smallskip{}


\section{Delayed Indirect Information Update And Read Accesses}

Every read access to any block of a file will finally be performed by using
a call to atl\_bread. From the write algorithm outlined in the section above,
the following pseudo-code implementation of atl\_bread can be derived:
\smallskip{}

\begin{lyxcode}
atl\_bread(inode\_num,~log\_blocknum,~create,~want\_modify)

\{

~~~~~block~=~find\_block\_in\_segmgr(inode,~log\_blocknum,~create);

~~~~~if~(found(block))

~~~~~~~~return~block;

~~~~~

~~~~~phys\_blockaddr~=~lookup\_phys\_blockaddress\_in\_atl(inode,~

~~~~~~~~~~~~~~~~~~~~~~~~~log\_blocknum);

~~

~~~~~if~(found(phys\_blockaddr))~\{

~~~~~~~~~~~~~block~=~bread(inode->device,~phys\_blockaddr);

~~~~~~~~~~~~~if~(want\_modify)

~~~~~~~~~~~~~~~~~~~~~transfer\_block\_to\_segmgr(inode,~log\_blocknum,~

~~~~~~~~~~~~~~~~~~~~~~~~block);

~~~~~~~~~~~~~return~block;

~~~~~\}

~~

~~~~~block~=~fal\_readblock(inode,~log\_blocknum);

~~~~~if~(!found(block))~\{

~~~~~~~~if(!create)

~~~~~~~~~~~~~~~~block~=~NO\_BLOCK;

~~~~~~~~else

~~~~~~~~~~~~~~~~block~=~get\_new\_block\_from\_segmgr(inode,~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~log\_blocknum);

~~~~~\}

~~

~~~~~return~block;

\}~
\end{lyxcode}
So a block is first looked for in the segment manager. If it cannot be found
there, we try to look up its current physical block address in the address translation
layer. This takes care of situations in which only a part of the dirty blocks
of a file have already been written out and the on-disk indirect information
has not been updated.
\smallskip{}

If we can determine the physical address of the requested block on the device
by that means, the respective block is read in directly by calling the device
read function \char`\"{}bread\char`\"{}.\footnote{
maybe use ll\_rw\_block.
} 
\smallskip{}

If the block could not be found in neither the segment manager nor the address
translation layer, the traditional file system is asked to read in the block
by calling \texttt{fal\_readblock}.
\smallskip{}

If this fails too, then the block is not a part of the file already. This might
happen when new data is appended at the end of a file and a new disk block must
be allocated to hold the new data.
\smallskip{}

By using this scheme it is not necessary to update the indirect information
of the traditional file system at the same time the data is written by the application
program:
\smallskip{}


\section{Advantages Of This Approach}

When the application program writes data to a file, the blocks affected by the
write will be found in the segment manager without using the traditional filesystem's
indirect information. So it does not matter that this information may not be
up-to-date. Whenever data is accessed that has not been modified, it will not
be found in the segment manager and \texttt{fal\_readblock} might use the indirect
information of the traditional file system to located the data requested. 
\smallskip{}

However, this does not cause a problem since the information needed from the
indirect information for locating a block that has not been modified is still
accurate.\footnote{
This holds true as long as the writing of block A in a file does not affect
the on-disk location of another, unmodified block B in the same file, which
should not be too hard a limitation.
} 
\smallskip{}

This also means that it is not necessary to map the indirect blocks somehow
into the logical block addressing scheme (as it is done by the BSD LFS implementation)
that is used by dtfs since it is not necessary to access them from dtfs code. 
\smallskip{}

Another benefit of this approach is that it avoids incorporating a certain addressing
scheme used by a traditional file system into dtfs-specific code thus easing
the future porting of other traditional file systems that might use a different
scheme of indirection.
\smallskip{}

Furthermore, this strategy avoids any consistency problems in indirect block
information: The Linux VFS layer serializes write accesses to a file, but it
does not hinder any read accesses to that file while a write to this file is
in progress. Since the indirect information of the traditional file system is
not used to locate modified blocks of a file, no problems can arise from inconsistencies
in these data structures as they could happen for a read operation that is performed
while a write is in progress.
\smallskip{}

\appendix


\section{Changes}

\$Id: writesupport.lyx,v 1.2 1998/05/12 09:24:52 ceci Exp \$ - \$Log: writesupport.lyx,v
\$\$Id: writesupport.lyx,v 1.2 1998/05/12 09:24:52 ceci Exp \$ - Revision 1.1
 1998/05/12 09:14:06  ceci\$Id: writesupport.lyx,v 1.2 1998/05/12 09:24:52 ceci
Exp \$ - Initial revision\$Id\$ -



\end{document}
