14. Multimedia – Express Learning: Data Communications and Computer Networks

14

Multimedia

1. Define multimedia.

Ans: The word multimedia is made up of two separate words, multi means many and media means the ways through which information may be transmitted. Therefore, multimedia can be described as an integration of multiple media elements (such as text, graphics, audio, video and animation) together to influence the given information, so that it can be presented in an attractive and interactive manner. In simple words, multimedia means being able to communicate in more than one way.

2. What are the different categories of Internet audio/video services?

Ans: Earlier, the use of Internet was limited for only sending/receiving text and images; however, nowadays, it is vastly being used for audio and video services. These audio and video services are broadly classified into three categories, namely, streaming stored audio/video, streaming live audio/video and real-time interactive audio/video. Here, the term streaming implies that the user can listen to or view the audio/video file after its downloading has begun.

Streaming Stored Audio/Video: In this category, the audio/video files are kept stored on a server in a compressed form. The users can download these stored files whenever required using the Internet; that is why this category is also termed as on-demand audio/video. Some examples of stored audio/video include songs, movies and video clips.
Streaming Live Audio/Video: In this category, as the term live implies, the users can listen to or view audio/video that are broadcast through the Internet. Some examples of live audio/video applications include Internet radio and Internet TV.
Real Time (Interactive) Audio/Video: In this category, users can communicate with each other in an interactive manner using the Internet. This category of Internet audio/video is used for real-time interactive audio/video applications such as Internet telephony, online video chatting and Internet teleconferencing.

3. How does streaming live audio/video differ from streaming stored audio/video?

Ans: The major difference between streaming live audio/video and streaming stored audio/video is that in the former, the communication is meant for a single user (unicast) and occurs on demand when the user downloads the file whereas in the latter, the communication is live instead of on demand and file is broadcast to multiple users at the same time.

4. Explain the approaches that can be used to download streaming stored audio/video files.

Ans: The streaming stored audio/video files are stored on a server in compressed form. In order to listen to or view these files, one needs to download them. There are mainly four approaches that can be used to download streaming audio/video files. These approaches are discussed as follows:

Using a Web Server

This is the simplest approach that allows the compressed audio/video file stored on a Web server to be downloaded as a text file. Following steps are involved in downloading file from the Web server.

  1. The browser on the client machine establishes a TCP connection with the Web server on which the desired audio/video file is stored.
  2. After the connection has been established, the browser requests for desired file by sending a GET message to the Web server using the HTTP services.
  3. The Web server in response retrieves the desired compressed file from the disk and sends it to the browser on the client machine.
  4. The browser writes the received audio/video file to the disk.
  5. The browser uses some helper application such as Windows Media Player that retrieves the contents of file from the disk block by block and plays back the file.

This approach does not require streaming. Thus, the file cannot be played black on the media player until it gets downloaded entirely.

Using a Web Server with Metafile

This approach overcomes the limitation of first approach by allowing the media player on client machine to directly connect with the Web server, so that file can be played back as its downloading starts. There are two files stored on the Web server, one is the original audio/video file and the other is a metafile that contains the link (URL) to the desired file. Following are the steps for this approach. Following steps are involved in this approach.

  1. The browser on the client machine sends the GET message using the HTTP services to access the Web server.
  2. The Web server in response sends back the metafile.
  3. The browser passes the metafile to the media player.
  4. The media player reads the URL in the metafile and sends GET message using the HTTP services to access the audio/video file.
  5. The Web server responds now to the media player and not to the browser.
  6. The file is played back on the media player, as it is being fetched by the Web server.

Using a Media Server

In the second approach, both the browser and the media player access the Web server using the HTTP services. However, in certain cases, it may happen that the server mentioned in URL stored in metafile is different from the Web server. Moreover, it may not be even an HTTP server, but some specialized media server. In such cases, the audio/video files cannot be downloaded using HTTP; rather any protocol that runs over UDP can be used to download the file. Following are the steps involved in this approach.

  1. The browser on the client machine uses the HTTP services and sends the GET message to access the Web server.
  2. The Web server in response sends back the metafile.
  3. The browser passes the metafile to the media player.
  4. The media player reads the URL in the metafile and sends GET message to the media server to download the audio/video file from it.
  5. The media server responds to the media player.
  6. The file is played back on the media player, as it is being fetched by the media server.

Using a Media Server and Real Time Streaming Protocol (RTSP)

In this approach, the media server uses RTSP that offers additional functionalities to the streaming of media files such as play, record and pause. The steps involved in this approach are as follows:

  1. The browser on the client machine uses the HTTP services and sends the GET message to access the Web server.
  2. The Web server in response sends back the metafile.
  3. The browser passes the metafile to the media player.
  4. The media player reads the URL in the metafile and sends a SETUP message to the media server to establish a connection between them.
  5. The media sever responds to the media player.
  6. The media player forwards a PLAY message to the media server to start downloading of file.
  7. The downloading of file gets started using a protocol that runs over UDP. The file is played back on the media player, as it is being fetched by the media server. Notice that while the file is being downloaded, the media player can send a PAUSE message to the media server to temporarily stop the downloading which can later be resumed by sending PLAY message.
  8. After the file has been downloaded completely, the media player sends a TEARDOWN message to the media server to release the connection between them.
  9. The media server in response releases the connection.

5. Write a short note on real-time transport protocol (RTP)?

Ans: RTP is a transport layer protocol designed for interactive (real-time) multimedia applications on the Internet such as Internet telephony, video-on-demand and video conferencing. These applications involve multiple streams of multimedia data including text, audio, video and graphics, being transmitted in real time over the Internet and RTP is responsible for managing these streams. RTP has been placed above UDP in the protocol hierarchy and is implemented in the user space (application layer); however, it must be used with UDP.

The basic function of RTP is to multiplex multiple real-time data streams of multimedia data onto single stream of UDP packets. To do this, different streams of multimedia data are first sent to RTP library that is implemented in the user space along with the multimedia application. The RTP library multiplexes these streams and encodes them in RTP packets, which are then stuffed into a socket. At the other end of the socket, the UDP packets are created and embedded in IP packets. Now, depending on the physical media say Ethernet, the IP packets are further embedded in Ethernet frames for transmission. Besides, some other functions of RTP include sequencing, time-stamping and mixing.

6. Explain the RTP packet header format.

Ans: The RTP is a transport layer protocol that has been designed to handle real-time traffic on the Internet. Figure 14.1 shows the format of an RTP packet header, which consists of various fields. Each field of RTP packer header is described as follows:

Ver.: It is a 2-bit long field that indicates the version number of RTP. The current version of RTP is 2.
P: It is a 1-bit long field that indicates the presence of padding (if set to 1) or absence of padding (if set to 0) at the end of the packet. Usually, the packet is padded if it has been encrypted. The number of padding bytes added to the data is indicated by the last byte in padding.
X: It is a 1-bit long field that is set to 1 if an extra extension header is present between the basic header and the data; otherwise, its value is 0.
Contributor Count (CC): It is a 4-bit long field that signifies the total number of contributing sources participating in the session. Since this field is of four bits, its value can range from 0 to 15, that is, the maximum of 15 contributing sources can participate in the same session.
M: It is a 1-bit marker field that is specific to the application being used. The application uses this marker to indicate certain things such as the start of a video frame and the end of a video frame.
Payload Type: It is a 7-bit long field that indicates the type of encoding algorithm used such as MP3, delta encoding and predictive encoding.
Sequence Number: It is a 16-bit long field that indicates the sequence number of the RTP packet. Each packet that is sent in an RTP stream during a session is assigned a specific sequence number. The sequence number for the first packet during a session is selected randomly and it is increased by one for every subsequent packet. The purpose of assigning sequence number to packets is to enable the receiver to detect the missing packets (if any).
Timestamp: It is a 32-bit long field that is set by the source of stream to indicate when the first sample in the packet was produced. Generally, the first packet is assigned a timestamp value at random while the timestamp value of each subsequent packet is assigned relative to that of the first (or previous) packet. For example, consider there are four packets with each packet containing 15 s of information. This implies, if the first packet starts at t = 0, then second packet should start at t = 15, the third packet at t = 30 and the fourth at t = 45. This time relationship between the packets must be preserved at the playback time on the receiver to avoid the jitter problem. For this, timestamp values are assigned to receiver. Suppose the first packet is assigned a timestamp value 0, then the timestamp value for the second, third and fourth packets should be 15, 30 and 45, respectively. Now, the receiver on receiving a packet can add its timestamp to the time at which it starts playback. Thus, by separating the playback time of packets from their arrival time, the jitter problem is prevented at the receiver.
Synchronization Source Identifier: It is a 32-bit long field that indicates the source stream of the packet. In case of a single source, this field identifies that source. However, in case of multiple sources, this field identifies the mixer—the synchronization source—and the rest of the sources are the contributors identified by the contributing source identifiers. The role of mixer is to combine the streams of RTP packets from multiple sources and forward a new RTP packet stream to one or more destinations.
Contributing Source Identifier: It is a 32-bit long field that identifies a contributing source in the session. There can be maximum 15 contributing sources in a session and accordingly, 15 contributing source identifiers.

Figure 14.1 RTP Packet Header Format

7. What is RTCP? Describe the packet types defined by it?

Ans: RTCP (stands for real time transport control protocol) is called a sibling protocol of RTP. RTP enables to transport messages containing only data; however, at certain times, some different types of messages that control the flow and quality of data are required to be sent. For example, a receiver may send a feedback on received quality of service to the source as well as other participants in the session, so that better quality could be achieved in future. To send such messages, RTCP is used. RTCP transmission contains five types of packets (containing messages) encapsulated into a single-UDP datagram. These packet types are described as follows:

Sender Report (SR): This packet is used by RTCP receivers to provide feedback about the quality aspects of data transmission and reception from active senders. It may contain information regarding jitter, delay, congestion, bandwidth and other network properties. Absolute time stamp, that is, the number of seconds elapsed since midnight on January 1, 1970 is included in the sender's report. This helps the receiver in synchronizing multiple RTP packets.
Receiver Report (RR): This packet is used for passive participants that do not transmit any RTP packets. The RR packet communicates the information related to quality of service to the senders as well as other receivers.
Source Description (SDES): This packet is transmitted periodically by a source to give more information about itself such as its user name, e-mail address, geographical location and telephone number.
BYE: This packet is a direct announcement sent by a source to indicate its exit from a session. Whenever a mixer receives a Bye message, it forwards this message along with list of sources still participating in the session to all the stations in session. A Bye packet can also contain the reason for exit as a description in text format.
Application-defined Packet: This packet is an experimental packet for applications that wish to use new applications that have not been defined in the RTCP standard. Eventually, if an experimental packet type is found useful, it may be assigned a packet type number and included in RTCP standards.

8. How is SIP used in the transmission of multimedia?

Ans: SIP, which stands for session initiation protocol, is designed by IETF to handle communication in real-time (interactive) audio/video applications such as Internet telephony, also called voice over IP (VoIP). It is a text-based application layer protocol that is used for establishing, managing and terminating a multimedia session. This protocol can be used for establishing two-party, multiparty or multicast sessions during which audio, video or text data may be exchanged between the parties. The sender and receiver participating in the session can be identified through various means such as e-mail addresses, telephone numbers or IP addresses provided all these are in SIP format. That is why SIP is considered very flexible.

Some of the services provided by SIP include defining telephone numbers as URLs in Web pages, initiating a telephone call by clicking a link in a Web page, establishing a session from a caller to a callee and locating the callee. Some other features of SIP include call waiting, encryption and authentication.

SIP defines six messages that are used while establishing a session, communicating and terminating a session. Each of these messages is described as follows:

INVITE: This message is used by the caller to start a session.
ACK: This message is sent by the caller to callee to indicate that the caller has received callee's reply.
BYE: This message is used by either caller or callee to request for terminating a session.
OPTIONS: This message can be sent to any system to query about its capabilities.
CANCEL: This message is used to cancel the session initialization process that has already started.
REGISTER: This message is sent by a caller to the registrar server to track the callee in case the callee is not available at its terminal, so that the caller can establish a connection with the callee. SIP designates some of the servers on the network as registrar server that knows the IP addresses of the terminals registered with it. At any moment, each user (terminal) must be registered with at least one registrar server on the network.

Before starting transmission of audio/video between the caller and callee, a session needs to be started. A simple SIP session has three phases (see Figure 14.2), which are as follows:

Figure 14.2 Phases in Session Using SIP

  1. Session Establishment: The session between the caller and callee is established using the three-way handshake. Initially, the caller invites the callee to begin a session by sending it an INVITE message. If the callee is ready for communication, it responds to the caller with a reply message (OK). On receiving the reply message, the caller sends the ACK message to callee to confirm the session initialization.
  2. Communication: Once the session has been established, the communication phase commences during which the caller and callee exchange audio data using temporary port numbers.
  3. Session Termination: After the data has been exchanged, either the caller or the callee can request for the termination of session by sending a BYE message. Once the other side acknowledges the BYE message, the session is terminated.

9. Explain the H.323 standard.

Ans: H.323 is an ITU standard that allows communication between the telephones connected to a public telephone network and the computers (called terminals) connected to the Internet. Like SIP, H.323 also allows two-party and multiparty calls using a telephone and a computer. The general architecture of this standard is shown in Figure 14.3.

Figure 14.3 H.323 Architecture

As shown in Figure 14.3, the two networks, Internet and telephone networks, are interconnected via a gateway between them. As we know that a gateway is a five-layer device that translates messages from a given protocol stack to different protocol stack. In H.323 architecture, the role of gateway is to convert between the H.323 protocols on the Internet side and PSTN protocols on the telephone side. The gatekeeper on the local area network serves as the registrar server and knows the IP addresses of the terminals registered with it.

H.323 comprises many protocols which are used to initiate and manage the audio/video communication between the caller and the callee. The G.71 or G.723.1 protocols are used for compression and encoding/decoding speech. The H.245 protocol is used between the caller and the callee to negotiate on a common compression algorithm that will be used by the terminals. The H.225 or RAS (Registration/ Administration/Status) protocol is used for communicating and registering with the gatekeeper. The Q.931 protocol is used for performing functions of the standard telephone system such as providing dial tones and establishing and releasing connections.

Following are the steps involved in communication between a terminal and a telephone using H.323.

  1. The terminal on a LAN that wishes to communicate with a remote telephone broadcasts a UDP packet to discover the gatekeeper. In response, the gatekeeper sends its IP address to the terminal.
  2. The terminal sends a RAS message in a UDP packet to the gatekeeper to register itself with the gatekeeper.
  3. The terminal communicates with the gatekeeper using the H.225 protocol to negotiate on bandwidth allocation.
  4. After the bandwidth has been allocated, the process of call setup begins. The terminal sends a SETUP message to the gatekeeper, which describes the telephone number of the callee or the IP address of a terminal if a computer is to be called. In response to receipt of SETUP message, the gatekeeper sends CALL PROCEEDING message to the terminal. The SETUP message is then forwarded to the gateway, which then makes a call to the desired telephone. As the telephone starts ringing up, an ALERT message is sent to the calling terminal by the end office to which the desired telephone is connected. After someone picks up the telephone, a CONNECT message is sent by the end office to the calling terminal to indicate the establishment of connection. Notice that during call setup, all entities including terminal, gatekeeper, gateway and telephone communicate using the Q.931 protocol. After the connection establishment, the gatekeeper is no longer involved.
  5. The terminal, gateway and telephone communicate using the H.245 protocol to negotiate on the compression method.
  6. The audio/video in form of RTP packets is exchanged between the terminal and telephone via gateway. For controlling the transmission, RTCP is used.
  7. Once either of the communicating parties hangs up, the connection is to be terminated. For this, the terminal, gatekeeper, gateway and telephone communicate using Q.931 protocol.
  8. The terminal communicates with the gatekeeper using the H.225 protocol to release the allocated bandwidth.

10. Define compression. What is the difference between lossy and lossless compression?

Ans: The components of multimedia such as audio and video cannot be transmitted over the Internet until they are compressed. Compression of a file refers to the process of cutting down the size of the file by using special compression algorithms. There are two types of compression techniques: lossy and lossless.

In lossy compression technique, some data is deliberately discarded in order to achieve massive reductions in the size of the compressed file. In this compression format, we cannot recover all of its original data from the compressed version. JPEG image files and MPEG video files are the examples of lossy compressed files. On the other hand, in lossless compression technique, the size of the file is reduced without permanently discarding any information of the original data. If an image that has undergone lossless compression is decompressed, the original data can be reconstructed exactly, bit-for-bit, that is, it will be identical to the digital image before compression. PNG image file formats use lossless compression.

11. Write a short note on audio compression.

Ans: Before the audio data can be transmitted over the Internet, it needs to be compressed. Audio compression can be applied on speech or music. There are two categories of techniques that can be used to compress audio, namely, predictive encoding and perceptual encoding.

Predictive Encoding: In digital audio or video, successive samples are usually similar to each other. Considering this fact, the initial frame and the difference values in the successive samples for all the samples are stored in the compressed form. As the size of the difference values between two samples is much smaller than the size of sample itself, this encoding technique saves much space. While decompressing, the previous sample and the difference value are used to reproduce the next sample. Predictive encoding is generally used for speech.
Perceptual Encoding: The human auditory system suffers from certain flaws. Exploiting this fact, the perceptual encoding technique encodes the audio signals in such a manner that they sound similar to human listeners, even they are different. This technique is generally used for compressing music, as it can create CD-quality audio. MP3, a part of MPEG standard, is the most common compression technique based on perceptual encoding.

12. How does frequency masking differ from temporal masking?

Ans: Effective audio compression takes into account the physiology of human hearing. The compression algorithm used is based on the phenomenon named simultaneous auditory masking—an effect that is produced due to the way the nervous system of human beings perceives sound. Masking can occur in frequency or time, accordingly named as frequency masking and temporal masking.

In frequency masking, a loud sound in a frequency range can partially or completely mask (hide) a low or softer sound in another frequency band. For example, in a room with loud noise, we are unable to hear properly the sound of a person who is talking to us.

In temporal masking, a loud sound can make our ears insensitive to any other sound for a few milliseconds. For example, on hearing a loud noise such as a gunshot or explosion, it makes our ears numb for a very short time before we are actually able to start hearing again properly.

13. Explain the JPEG process.

Ans: JPEG, which stands for joint photographic experts group, is the standard compression technique used to compress still images. It can compress images in lossy and lossless modes and produces high-quality compressed images. Following are the steps involved in the JPEG image compression (lossy) process.

  1. Colour Sub-Sampling: This step is performed only if the image to be compressed is coloured; for gray-scale images, this step is not required. The RGB colour space of the image is changed to YUV colour space and its chrominance component is down-sampled.
  2. Blocking: The image is divided into a series of 8 × 8-pixel blocks. Blocking also reduces the number of calculations needed for an image.
  3. Discrete Cosine Transformation (DCT): Each block of 8 × 8 pixels goes through the DCT transformation to identify the spatial redundancy in an image. The result of DCT transformation for each 8 × 8 block of pixels is 8 × 8 block of DCT coefficients (that is, 64 frequency components in each block).
  4. Quantization: In this phase, the DCT coefficients in each block are scalar quantized with the help of a quantization table (Q-table) in order to wipe out the less important DCT coefficients. Each value in the block is divided by a weight taken from the corresponding position in Q-table by ignoring the fractional part. The changes made in this phase cannot be undone. That is why this JPEG method is considered lossy.
  5. Ordering: The output of quantization is then ordered in a zigzag manner to distinguish the low-frequency components (usually, non-zero) from the high-frequency components (usually, zero). Ordering results in bit stream in which zero-frequency components are placed close to the end.
  6. Run-Length Encoding: The run-length encoding is applied to zeros of zigzag sequence to eliminate the redundancy. This encoding replaces each repeated symbol in a given sequence with the symbol itself and the number of times it is repeated. For example, the text “cccccccbbbbuffffff” is compressed as “c7b4u1f6”. Thus, redundant zeros are removed after this phase.
  7. Variable-length Encoding: The variable-length encoding is applied on the output of the previous phase to get the compressed JPEG bit stream. In variable-length encoding, a variable number of bits are used to represent each character rather than a fixed number of bits for each character. Fewer bits are used to represent the more frequently used character; the most frequently used character can be represented by one bit only. This helps in reducing the length of compressed data.

14. What is MPEG? Describe spatial and temporal compressions?

Ans: MPEG, which stands for moving picture experts group, is a method devised for the compression of a wide range of video and motion pictures. It is available in two versions: MPEG1 and MPEG2. The former version has been designed for CD-ROM with a data arête of 1.5 Mbps while the latter version has been designed for DVD with a data rate of 3–6 Mbps.

Each video is composed of a set of frames where each frame is actually a still image. The frames in a video flow so rapidly (for example, 50 frames per second in TV) that the human eye cannot notice the discrete images. This property of human eye forms the basis of motion pictures. Video compression using MPEG involves spatial compression of each frame and temporal compression of a set of frames.

Spatial Compression: Each frame in the video is spatially compressed with JPEG. Since each frame is an image, it can be separately compressed. Spatial compression is used for the purposes such as video editing where frames need to be randomly accessed.
Temporal Compression: In this compression, the redundancy is removed among the consecutive frames that are almost similar. For example, in a movie, there are certain scenes where the background is same and stationary and only some portion such as hand movement is changing. In such cases, the most consecutive frames will be almost similar except the portion of frame covering the hand movements. That is, the consecutive frames will be containing redundant information. The temporal redundancy can be eliminated using the differential encoding approach, which encodes the differences between adjacent frames and sends them. An alternative approach is motion compensation that compares each frame with its predecessor and records the changes in the coordinate values due to motion as well as the differences in pixels after motion.

15. Differentiate among the different types of encoded frames used in MPEG video compression.

Ans: In MPEG video compression, the encoded frames fall under three categories, namely, intracoded (I) frames, predicted (P) frames and bidirectional (B) frames. These frames are described as follows:

I-frame: This frame is not associated with any other frame (previous or next). It is encoded independently of other frames in the video and contains all the necessary information that is needed to recreate the entire frame. Thus, I-frames cannot be constructed from any other frames. I-frames must appear in a movie at regular intervals to indicate the sudden changes in the frame.
P-frame: This frame relates to the frame preceding to it whether it is an I-frame or a P-frame. It contains small differences related to its preceding I-frame or P-frame; however, it is not useful for recording major modifications; for example, in case of fast-moving objects. Thus, P-frames carry only a small amount of information as compared to other frames and even more less number of bits after compression. Unlike I-frames, a P-frame can be constructed only from its preceding frame.
B-frame: This frame, as the name implies, relates to its preceding as well as succeeding I-frame or P-frame. However, a B-frame cannot relate to any other B-frame. B-frames provide improved motion compensation and the best compression.

Multiple Choice Questions

  1. Which of the following services is provided by RTP?

    (a)   Time-stamping

    (b)   Sequencing

    (c)   Mixing

    (d)   All of these

  2. In________, the user can listen to or watch the file only after the file has been downloaded.

    (a)   Streaming stored audio/video

    (b)   Streaming live audio/video

    (c)   Interactive audio/video

    (d)   None of these

  3. MP3 audio compression uses two phenomena, namely,________.

    (a)   Spatial compression and temporal compression

    (b)   DCT and quantization

    (c)   Frequency masking and temporal masking

    (d)   None of these

  4. Which step of the JPEG image compression is not needed for grey-scale images?

    (a)  Quantization

    (b)  Colour sub-sampling

    (c)  Blocking

    (d)  Discrete cosine transformation

  5. Which of the following approaches that is used to download streaming stored audio/ video does not involve streaming?

    (a)  Using a Web server

    (b)  Using a media server

    (c)  Using a Web server with a metafile

    (d)  Using a media server and RTSP

  6. Which of the following is a characteristic of real-time interactive audio/video?

    (a)  Time relationship

    (b)  Mixing

    (c)  Ordering

    (d)  All of these

  7. Which of the following message types is provided by both RTCP and SIP?

    (a)  INVITE

    (b)  BYE

    (c)  Sender report

    (d)  None of these

  8. In H.323 architecture,___________serves as the registrar server.

    (a)  Gateway

    (b)  Terminal

    (c)  Gatekeeper

    (d)  Telephone network

Answers

1. (d)

2. (a)

3. (c)

4. (b)

5. (a)

6. (d)

7. (b)

8. (c)